[
https://issues.apache.org/jira/browse/PIG-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497873#comment-14497873
]
Adrien Bidault commented on PIG-4507:
-------------------------------------
Thanks for your answer, I really appreciate it.
We have thought of a similar technique. However there is a particularity to the
ponctuation cleaning in our case: for instance, we need to suppress all the
dots excpet for those that belong to the "word" in a given list (for instance
since ".net" is a part of this list, dot being a part of this word must not be
cleaned out). This list is pretty extensive, thus looping trhough it with a
condition seems tedious and inefficient. This is where the idea of the REGEX
treatments stemmed from.
At the moment we are doing this but it's not efficient because of the fact that
the list of the "reserved termes" may contain couple of words and the
comparison with a simple token never matches.
Exemple:
Now we have (.net)
(3.0)
And If we just keep the couples of terms (.net 3.0) it can't work here.
Consequently, the need to apply the REGEX to the entire string (not a
collection of tokens).
clean2 = FOREACH clean1 GENERATE id, FLATTEN(TOKENIZE(query)) as query;
clean3 = FILTER clean2 by query MATCHES '(.net)|(.net 3.0)|(.net
4.0)|.*(\\w+).*' ; (it's just an extract of the REGEX)
Regards
Adrien
> Problem with REGEX which just match for the first word
> ------------------------------------------------------
>
> Key: PIG-4507
> URL: https://issues.apache.org/jira/browse/PIG-4507
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.12.0
> Environment: IBM Infosphere BigInsights v3.0.0.1
> Reporter: Adrien Bidault
> Original Estimate: 6h
> Remaining Estimate: 6h
>
> I am trying to eliminate punctuation and special symbols from a string using
> REGEX of a type "(\\w+)". The problem is that this REGEX treatment is applied
> to the first word of the string only.
> Example:
> clean3 = FOREACH clean1 GENERATE id, REGEX_EXTRACT_ALL('toto, likes ... to
> play ', '(\\w+)');
> It just resturn "toto" instead of "toto likes to play"
> Would you guys have any ideas?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)