[ 
https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714102#action_12714102
 ] 

Martin Dittus commented on HIVE-519:
------------------------------------

To reiterate -- the behaviour reported is that records with certain character 
sequences (in this example: an HTTP request line with parentheses in the 
requested path) take many orders of magnitude longer to process than usual. 

This looks like it's a result of Java's regex implementation using a 
Non-deterministic Finite Automaton, which performs badly in worst-case 
scenarios (like probably this one.) Check e.g. 
http://weblogs.java.net/blog/tomwhite/archive/2006/03/a_faster_java_r.html for 
some background.

There are essentially two options to avoid this: Alter the expression, or use a 
different regex library. There may be a way to do the former. 

I'll use TestTCTLSeparatedProtocol.test1ApacheLogFormat() as an example.

This is the pattern generated for this test: (?:^| 
)(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
Note the sub-expression: ...[^("|\[|\])]+...

I.e., it builds a pattern for a negating character class ("[^...]") that 
unfortunately doesn't contain a list of characters, but instead another pattern 
group in parentheses. Namely, the value of QUOTE_CHAR: ("|\[|\])

After I manually converted this sub-expression to the legitimate "[^"\[\]]+" 
character class the pattern matcher performed admirably against Johan's test 
case.

Tbh I'm not sure if the character class that currently gets generated is valid; 
at minimum it may have some unintended side-effects. To implement this properly 
the pattern builder would need to have access to two different representations 
of the QUOTE_CHAR parameter (as a grouped expression and as a character class) 
when there currently only is one. 

(You probably need to apply HIVE-520 first to make the 
TestTCTLSeparatedProtocol unit test run.)


> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: 
> http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher 
> gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to