[ 
https://issues.apache.org/jira/browse/PIG-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497873#comment-14497873
 ] 

Adrien Bidault commented on PIG-4507:
-------------------------------------

Thanks for your answer, I really appreciate it.

We have thought of a similar technique. However there is a particularity to the 
ponctuation cleaning in our case: for instance, we need to suppress all the 
dots excpet for those that belong to the "word" in a given list (for instance 
since ".net" is a part of this list, dot being a part of this word must not be 
cleaned out). This list is pretty extensive, thus looping trhough it with a 
condition seems tedious and inefficient. This is where the idea of the REGEX 
treatments stemmed from. 

At the moment we are doing this but it's not efficient because of the fact that 
the list of the "reserved termes" may contain couple of words and the 
comparison with a simple token never matches.
Exemple:
Now we have (.net)
                      (3.0)
And If we just keep the couples of terms (.net 3.0) it can't work here.

 Consequently, the need to apply the REGEX to the entire string (not a 
collection of tokens).

clean2 = FOREACH clean1 GENERATE id, FLATTEN(TOKENIZE(query)) as query;
clean3 = FILTER clean2 by query MATCHES '(.net)|(.net 3.0)|(.net 
4.0)|.*(\\w+).*' ; (it's just an extract of the REGEX)

Regards

Adrien


> Problem with REGEX which just match for the first word
> ------------------------------------------------------
>
>                 Key: PIG-4507
>                 URL: https://issues.apache.org/jira/browse/PIG-4507
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>         Environment: IBM Infosphere BigInsights v3.0.0.1
>            Reporter: Adrien Bidault
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I am trying to eliminate punctuation and special symbols from a string using 
> REGEX of a type "(\\w+)". The problem is that this REGEX treatment is applied 
> to the first word of the string only.
> Example:
> clean3 = FOREACH clean1 GENERATE id, REGEX_EXTRACT_ALL('toto,  likes ... to 
> play ', '(\\w+)');
> It just resturn "toto" instead of "toto likes to play"
> Would you guys have any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to