[
https://issues.apache.org/jira/browse/BEAM-7018?focusedWorklogId=284236&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-284236
]
ASF GitHub Bot logged work on BEAM-7018:
----------------------------------------
Author: ASF GitHub Bot
Created on: 29/Jul/19 12:41
Start Date: 29/Jul/19 12:41
Worklog Time Spent: 10m
Work Description: robertwb commented on issue #8859: [BEAM-7018] Added
Regex transform for PythonSDK
URL: https://github.com/apache/beam/pull/8859#issuecomment-515974498
On Wed, Jul 24, 2019 at 6:30 PM Shoaib Zafar <[email protected]>
wrote:
> *@mszb* commented on this pull request.
> ------------------------------
>
> In sdks/python/apache_beam/transforms/util.py
> <https://github.com/apache/beam/pull/8859#discussion_r306906519>:
>
> > + group: (optional) name of the group, it can be integer or a
string value.
> + """
> + regex = Regex._regex_compile(regex)
> + group = group or 0
> +
> + def _process(element):
> + r = regex.search(element)
> + if r:
> + yield r.group(group)
> + return pcoll | FlatMap(_process)
> +
> + @staticmethod
> + @typehints.with_input_types(str)
> + @typehints.with_output_types(str)
> + @ptransform_fn
> + def find_all(pcoll, regex):
>
> So what I understand from your feedback is:
>
> 1. for the 'Regex.find_all()', we need to change re.finditer to
> re.findall and return the result as a list.
> 2. create a new method Regex.finditer() which returns "match objects"
> instances..
>
> The issue with the 'MatchObject' is we can not pickle it.
> Showing error PicklingError: Can't pickle <built-in method match of
> _sre.SRE_Pattern
>
Oh, that is really unfortunate :(.
> So either we can make a custom match object class and return it
>
This seems nicest from an API perspective, but I'm worried about the
maintenance in faithfully emulating the match object over time.
> or uses KV pairs? Where K as the string of group and V as a tuple
> containing all groups?
> For example:
> lets say sting is "abb ax abbb" & regex expression is "a(b*)", so the
> results would be like:
> [ KV("abb", ('bb',)), KV("a", ('',)), KV("abbb", ('bbb',) ]
> or maybe we can include group 0 in the values as well.
> [ KV("abb", ('abb', 'bb',)), KV("a", ('ab', '')), KV("abbb", ('abbb',
> 'bbb') ]
>
In Python there's no such thing as a KV, it's just a 2-tuple.
Here's what I would do: many other transforms have an (optional) group
parameter, defaulting to 0. I would do that for find_all as well, so the
result would be simply
["abb", "a", "abbb"]
You could also have a special value (say, a constant like Regex.ALL) that
would return all the groups, i.e.
[('a', 'abb'), ('a', ''), ('abbb', 'bbb')]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 284236)
Time Spent: 9h 40m (was: 9.5h)
> Regex transform for Python SDK
> ------------------------------
>
> Key: BEAM-7018
> URL: https://issues.apache.org/jira/browse/BEAM-7018
> Project: Beam
> Issue Type: New Feature
> Components: sdk-py-core
> Reporter: Rose Nguyen
> Assignee: Shehzaad Nakhoda
> Priority: Minor
> Time Spent: 9h 40m
> Remaining Estimate: 0h
>
> PTransorms to use Regular Expressions to process elements in a PCollection
> It should offer the same API as its Java counterpart:
> [https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Regex.java]
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)