[ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714102#action_12714102 ]
Martin Dittus commented on HIVE-519: ------------------------------------ To reiterate -- the behaviour reported is that records with certain character sequences (in this example: an HTTP request line with parentheses in the requested path) take many orders of magnitude longer to process than usual. This looks like it's a result of Java's regex implementation using a Non-deterministic Finite Automaton, which performs badly in worst-case scenarios (like probably this one.) Check e.g. http://weblogs.java.net/blog/tomwhite/archive/2006/03/a_faster_java_r.html for some background. There are essentially two options to avoid this: Alter the expression, or use a different regex library. There may be a way to do the former. I'll use TestTCTLSeparatedProtocol.test1ApacheLogFormat() as an example. This is the pattern generated for this test: (?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*) Note the sub-expression: ...[^("|\[|\])]+... I.e., it builds a pattern for a negating character class ("[^...]") that unfortunately doesn't contain a list of characters, but instead another pattern group in parentheses. Namely, the value of QUOTE_CHAR: ("|\[|\]) After I manually converted this sub-expression to the legitimate "[^"\[\]]+" character class the pattern matcher performed admirably against Johan's test case. Tbh I'm not sure if the character class that currently gets generated is valid; at minimum it may have some unintended side-effects. To implement this properly the pattern builder would need to have access to two different representations of the QUOTE_CHAR parameter (as a grouped expression and as a character class) when there currently only is one. (You probably need to apply HIVE-520 first to make the TestTCTLSeparatedProtocol unit test run.) > Regex processing gets stuck when querying weblogs > ------------------------------------------------- > > Key: HIVE-519 > URL: https://issues.apache.org/jira/browse/HIVE-519 > Project: Hadoop Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.4.0 > Reporter: Johan Oskarsson > Priority: Critical > Fix For: 0.4.0 > > Attachments: hive-stdout > > > When running a simple query on a table similar to the apachelog table here: > http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher > gets stuck in an infinite loop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.