[
https://issues.apache.org/jira/browse/PIG-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724076#comment-13724076
]
Cheolsoo Park commented on PIG-3395:
------------------------------------
I have been testing my patch as per Rohini's suggestion, and it works
correctly. Here are what I tested:
* (cast) or (pcond and pcond) or (pcond and pcond)
* (pcond and pcond) or (cast) or (pcond and pcond)
* (pcond and pcond) or (pcond and pcond) or (cast)
In all these cases, the entire filter is rejected due to the cast expression,
which is the same as before.
Adding test cases is a bit more involving because the test helper function
isn't written for such conditions. But I will add a few test cases.
> Large filter expression makes Pig hang
> --------------------------------------
>
> Key: PIG-3395
> URL: https://issues.apache.org/jira/browse/PIG-3395
> Project: Pig
> Issue Type: Bug
> Components: impl
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.12
>
> Attachments: PIG-3395.patch, thread_dump.txt
>
>
> Currently, partition filter push down is quite costly. For example, if you
> have many nested or/and expressions, Pig hangs:
> {code}
> base = load '<partitioned table>' using MyStorage();
> filt = filter base by
> (dateint == 20130719 and batchid == 'merged_1' and hour IN (19,20,21,22,23))
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN
> (0,1,2,3,4,5,6,7,8))
> or
> (dateint == 20130720 and batchid == 'merged_2' and hour == 7)
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN
> (9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130721 and batchid == 'merged_1' and hour IN
> (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130722 and batchid == 'merged_1' and hour IN
> (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16));
> dump filt;
> {code}
> Note that IN operator is converted to nested OR's by Pig parser.
> Looking at the thread dump, I found it creates almost 60 stack frames and
> makes JVM suffer. (I will attach full stack trace.)
> {code}
> <repeated ...>
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:237)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:214)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:211)
> at
> org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:108)
> {code}
> Although the filter expression can be simplified, it seems possible to make
> PColFilterExtractor more efficient.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira