[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
[ https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916998#action_12916998 ] Daniel Dai commented on PIG-1659: - We should set sortInfo after optimization. So we should add SetSortInfo after the optimization of new logical plan. This code is missing. sortinfo is not set for store if there is a filter after ORDER BY - Key: PIG-1659 URL: https://issues.apache.org/jira/browse/PIG-1659 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Daniel Dai Fix For: 0.8.0 This has caused 6 (of 7) failures in the Zebra test TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
[ https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1659: Attachment: PIG-1659-1.patch sortinfo is not set for store if there is a filter after ORDER BY - Key: PIG-1659 URL: https://issues.apache.org/jira/browse/PIG-1659 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1659-1.patch This has caused 6 (of 7) failures in the Zebra test TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1542) log level not propogated to MR task loggers
[ https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917079#action_12917079 ] Daniel Dai commented on PIG-1542: - Yes, -d xxx should treat as -Ddebug=xxx. And system properties already have higher priority in the current code. (And in my mind, we should deprecate -d in favor of -Ddebug) log level not propogated to MR task loggers --- Key: PIG-1542 URL: https://issues.apache.org/jira/browse/PIG-1542 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch Specifying -d DEBUG does not affect the logging of the MR tasks . This was fixed earlier in PIG-882 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916725#action_12916725 ] Daniel Dai commented on PIG-1638: - +1 sh output gets mixed up with the grunt prompt - Key: PIG-1638 URL: https://issues.apache.org/jira/browse/PIG-1638 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: niraj rai Assignee: niraj rai Priority: Minor Fix For: 0.8.0 Attachments: PIG-1638_0.patch Many times, the grunt prompt gets mixed up with the sh output.e.g. grunt sh ls 000 autocomplete bin build build.xml grunt CHANGES.txt conf contrib In the above case, grunt is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1638) sh output gets mixed up with the grunt prompt
[ https://issues.apache.org/jira/browse/PIG-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1638: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. sh output gets mixed up with the grunt prompt - Key: PIG-1638 URL: https://issues.apache.org/jira/browse/PIG-1638 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: niraj rai Assignee: niraj rai Priority: Minor Fix For: 0.8.0 Attachments: PIG-1638_0.patch Many times, the grunt prompt gets mixed up with the sh output.e.g. grunt sh ls 000 autocomplete bin build build.xml grunt CHANGES.txt conf contrib In the above case, grunt is mixed up with the output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug
TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug Key: PIG-1652 URL: https://issues.apache.org/jira/browse/PIG-1652 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias records3 at org.apache.pig.PigServer.storeEx(PigServer.java:877) at org.apache.pig.PigServer.store(PigServer.java:815) at org.apache.pig.PigServer.openIterator(PigServer.java:727) at org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197) at org.apache.pig.PigServer.storeEx(PigServer.java:873) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:126) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 69: org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file: at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.parse(URI.java:3009) at java.net.URI.init(URI.java:736) at org.apache.hadoop.fs.Path.initialize(Path.java:137) The reason is we are trying to
[jira] Created: (PIG-1653) Scripting UDF fails if the path to script is an absolute path
Scripting UDF fails if the path to script is an absolute path - Key: PIG-1653 URL: https://issues.apache.org/jira/browse/PIG-1653 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 The following script fail: {code} register '/homes/jianyong/pig/aaa/scriptingudf.py' using jython as myfuncs; a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age, gpa:double); b = foreach a generate myfuncs.square(gpa); dump b; {code} If we change the register to use relative path (such as aaa/scriptingudf.py), it success. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915880#action_12915880 ] Daniel Dai commented on PIG-1637: - test-patch result for PIG-1637-2.patch: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1637-1.patch, PIG-1637-2.patch The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915941#action_12915941 ] Daniel Dai commented on PIG-1579: - Rollback the change and run test many times, all tests pass. Seems some change between r990721 and now (r1002348) fix this issue. Will rollback the change and close the Jira. Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1579-1.patch Error message: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function: Traceback (most recent call last): File iostream, line 5, in multStr TypeError: can't multiply sequence by non-int of type 'NoneType' at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915950#action_12915950 ] Daniel Dai commented on PIG-1637: - Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch solve the combiner not used issue, will commit this part first. I will open another Jira to improve it. Also, MergeForEach is a best example to practice cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so it is better to improve it once PIG-1587 is available. Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1637-1.patch, PIG-1637-2.patch The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915959#action_12915959 ] Daniel Dai commented on PIG-1651: - +1 PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1651.patch If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1637. - Hadoop Flags: [Reviewed] Resolution: Fixed All tests pass except for TestSortedTableUnion / TestSortedTableUnionMergeJoin for zebra, which are already fail and will be addressed by [PIG-1649|https://issues.apache.org/jira/browse/PIG-1649]. Patch committed to both trunk and 0.8 branch. Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1637-1.patch, PIG-1637-2.patch The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915365#action_12915365 ] Daniel Dai commented on PIG-1647: - +1. Please commit. Logical simplifier throws a NPE --- Key: PIG-1647 URL: https://issues.apache.org/jira/browse/PIG-1647 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1647.patch, PIG-1647.patch A query like: A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' and ((d is not null and d != '') or (e is not null and e != '')); will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1637: Attachment: PIG-1637-1.patch Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1637-1.patch The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915037#action_12915037 ] Daniel Dai commented on PIG-1643: - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, PIG-1643.4.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1643. - Release Note: PIG-1643.4.patch committed to both trunk and 0.8 branch. Resolution: Fixed join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch, PIG-1643.4.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-4.patch PIG-1644-4.patch fix findbug warnings and additional unit failures. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, PIG-1644-4.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1644. - Hadoop Flags: [Reviewed] Resolution: Fixed [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. Patch committed to both trunk and 0.8 branch. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch, PIG-1644-4.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914662#action_12914662 ] Daniel Dai commented on PIG-1635: - +1, patch looks good. Also can you have a review of all connect/disconnect usage in ExpressionSimplifer, according to [PIG-1644|https://issues.apache.org/jira/browse/PIG-1644]? I see lots of misuse in other rules. Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.8.0 Attachments: PIG-1635.patch b = FILTER a by (( f1 1) AND (1 == 1)) or b = FILTER a by ((f1 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-3.patch Find one bug introduced by refactory. Attach PIG-1644-3.patch with the fix, and running the tests again. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch, PIG-1644-2.patch, PIG-1644-3.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914675#action_12914675 ] Daniel Dai commented on PIG-1635: - +1 for commit. Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor Fix For: 0.8.0 Attachments: PIG-1635.patch b = FILTER a by (( f1 1) AND (1 == 1)) or b = FILTER a by ((f1 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reopened PIG-1643: - The following script does not produce the right result after patch: {code} a = load '/grid/2/dev/pigqa/in/singlefile/studenttab10k'; b = foreach a generate *; store b into '/grid/2/dev/pigqa/out/log/hadoopqa.1285338379/Foreach_2.out'; {code} join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch, PIG-1643.2.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1643: Attachment: PIG-1643.2.patch Attach a fix. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch, PIG-1643.2.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1639-1.patch The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1643: Attachment: PIG-1643.3.patch PIG-1643.3.patch is more general than PIG-1643.2.patch. It solves this null schema issue for all expressions. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch, PIG-1643.2.patch, PIG-1643.3.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'
[ https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914126#action_12914126 ] Daniel Dai commented on PIG-1643: - +1 if tests pass. join fails for a query with input having 'load using pigstorage without schema' + 'foreach' --- Key: PIG-1643 URL: https://issues.apache.org/jira/browse/PIG-1643 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1643.1.patch {code} l1 = load 'std.txt'; l2 = load 'std.txt'; f1 = foreach l1 generate $0 as abc, $1 as def; -- j = join f1 by $0, l2 by $0 using 'replicated'; -- j = join l2 by $0, f1 by $0 using 'replicated'; j = join l2 by $0, f1 by $0 ; dump j; {code} the error - {code} 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2044: The type null cannot be collected as a Key type {code} The MR plan from explain - {code} #-- # Map Reduce Plan #-- MapReduce node scope-21 Map Plan Union[tuple] - scope-22 | |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11 | | | | | Project[bytearray][0] - scope-12 | | | |---l2: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-0 | |---j: Local Rearrange[tuple]{NULL}(false) - scope-13 | | | Project[NULL][0] - scope-14 | |---f1: New For Each(false,false)[bag] - scope-6 | | | Project[bytearray][0] - scope-2 | | | Project[bytearray][1] - scope-4 | |---l1: Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage) - scope-1 Reduce Plan j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18 | |---POJoinPackage(true,true)[tuple] - scope-23 Global sort: false {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914147#action_12914147 ] Daniel Dai commented on PIG-1644: - Yes, I think we can do replace/remove/insert. They should be simple and clear enough to use. Here is the new methods adding to OperatorPlan: {code} replace(Operator oldOperator, Operator newOperator) remove(Operator operatorToRemove) // Connect all its successors to predecessor/connect all it's predecessors to successor insertBefore(Operator operatorToInsert, Operator pos) // Insert operatorToInsert before pos, connect all pos's predecessors to operatorToInsert insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert after pos, connect operatorToInsert to all pos's successor {code} How does it sounds? New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Summary: New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF (was: New logical plan: PushUpFilter should not optimize if filter condition contains UDF) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1639-1.patch The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914154#action_12914154 ] Daniel Dai commented on PIG-1639: - +1 if all tests pass. New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1639-1.patch The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914317#action_12914317 ] Daniel Dai commented on PIG-1644: - After looking into the existing code, seems insertBetween is a more useful method. So I want to drop insertBefore/insertAfter, and add insertBetween {code} insertBetween(Operator pred, Operator operatorToInsert, Operator succ) {code} New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-2.patch Attach the patch with new methods and refactory of existing code. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch, PIG-1644-2.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913714#action_12913714 ] Daniel Dai commented on PIG-1636: - test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. Scalar fail if the scalar variable is generated by limit Key: PIG-1636 URL: https://issues.apache.org/jira/browse/PIG-1636 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1636-1.patch The following script fail: {code} a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); b = group a all; c = foreach b generate SUM(a.age) as total; c1= limit c 1; d = foreach a generate name, age/(double)c1.total as d_sum; store d into '111'; {code} The problem is we have a reference to c1 in d. In the optimizer, we push limit before foreach, d still reference to limit, and we get the wrong schema for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1636) Scalar fail if the scalar variable is generated by limit
[ https://issues.apache.org/jira/browse/PIG-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1636. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Scalar fail if the scalar variable is generated by limit Key: PIG-1636 URL: https://issues.apache.org/jira/browse/PIG-1636 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1636-1.patch The following script fail: {code} a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); b = group a all; c = foreach b generate SUM(a.age) as total; c1= limit c 1; d = foreach a generate name, age/(double)c1.total as d_sum; store d into '111'; {code} The problem is we have a reference to c1 in d. In the optimizer, we push limit before foreach, d still reference to limit, and we get the wrong schema for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: (was: PIG-1644-1.patch) New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-1.patch New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places
[ https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1644: Attachment: PIG-1644-1.patch Attach the patch to address all such places in new logical plan, except for ExpressionSimplifier. There is some work underway for ExpressionSimplifier ([PIG-1635|https://issues.apache.org/jira/browse/PIG-1635]) include some of these changes, I don't want to conflict with that patch. So after PIG-1635, we may also review the connect/disconnect usage of ExpressionSimplifier. New logical plan: Plan.connect with position is misused in some places -- Key: PIG-1644 URL: https://issues.apache.org/jira/browse/PIG-1644 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1644-1.patch When we replace/remove/insert a node, we will use disconnect/connect methods of OperatorPlan. When we disconnect an edge, we shall save the position of the edge in origination and destination, and use this position when connect to the new predecessor/successor. Some of the pattens are: Insert a new node: {code} PairInteger, Integer pos = plan.disconnect(pred, succ); plan.connect(pred, pos.first, newnode, 0); plan.connect(newnode, 0, succ, pos.second); {code} Remove a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove); PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ); plan.connect(pred, pos1.first, succ, pos2.second); {code} Replace a node: {code} PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace); PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ); plan.connect(pred, pos1.first, newNode, pos1.second); plan.connect(newNode, pos2.first, succ, pos2.second); {code} There are couple of places of we does not follow this pattern, that results some error. For example, the following script fail: {code} a = load '1.txt' as (a0, a1, a2, a3); b = foreach a generate a0, a1, a2; store b into 'aaa'; c = order b by a2; d = foreach c generate a2; store d into 'bbb'; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-1.patch Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1605-1.patch In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1636) Scalar fail if the scalar variable is generated by limit
Scalar fail if the scalar variable is generated by limit Key: PIG-1636 URL: https://issues.apache.org/jira/browse/PIG-1636 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script fail: {code} a = load 'studenttab10k' as (name: chararray, age: int, gpa: float); b = group a all; c = foreach b generate SUM(a.age) as total; c1= limit c 1; d = foreach a generate name, age/(double)c1.total as d_sum; store d into '111'; {code} The problem is we have a reference to c1 in d. In the optimizer, we push limit before foreach, d still reference to limit, and we get the wrong schema for the scalar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function
Combiner not use because optimizor inserts a foreach between group and algebric function Key: PIG-1637 URL: https://issues.apache.org/jira/browse/PIG-1637 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script does not use combiner after new optimization change. {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization: {code} A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; C1 = foreach C generate B; D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); store D into ':OUTPATH:'; {code} That cancel the combiner optimization for D. The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
New logical plan: PushUpFilter should not optimize if filter condition contains UDF --- Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not optimize if filter condition contains UDF
[ https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1639: Description: The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} New logical plan: PushUpFilter should not optimize if filter condition contains UDF --- Key: PIG-1639 URL: https://issues.apache.org/jira/browse/PIG-1639 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 The following script fail: {code} a = load 'file' AS (f1, f2, f3); b = group a by f1; c = filter b by COUNT(a) 1; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1598) Pig gobbles up error messages - Part 2
[ https://issues.apache.org/jira/browse/PIG-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1598: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch looks good. Committed to both trunk and 0.8 branch. Pig gobbles up error messages - Part 2 -- Key: PIG-1598 URL: https://issues.apache.org/jira/browse/PIG-1598 Project: Pig Issue Type: Improvement Reporter: Ashutosh Chauhan Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG-1598_0.patch Another case of PIG-1531 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Attachment: PIG-1605-2.patch PIG-1605-2.patch fix findbug warnings. test-patch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 455 release audit warnings (more than the trunk's current 453 warning s). Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1605-1.patch, PIG-1605-2.patch In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1605. - Hadoop Flags: [Reviewed] Resolution: Fixed Release audit warning is due to jdiff. No new file added. Patch committed to both trunk and 0.8 branch. Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1605-1.patch, PIG-1605-2.patch In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909821#action_12909821 ] Daniel Dai commented on PIG-1608: - Two comments: 1. target buildJar-withouthadoop should also include this change 2. format comment: use space instead of tab Target jar, package looks good. pig should always include pig-default.properties and pig.properties in the pig.jar -- Key: PIG-1608 URL: https://issues.apache.org/jira/browse/PIG-1608 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Attachments: PIG-1608_0.patch pig should always include pig-default.properties and pig.properties as a part of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1608: Fix Version/s: 0.9.0 Affects Version/s: 0.8.0 pig should always include pig-default.properties and pig.properties in the pig.jar -- Key: PIG-1608 URL: https://issues.apache.org/jira/browse/PIG-1608 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: niraj rai Assignee: niraj rai Fix For: 0.9.0 Attachments: PIG-1608_0.patch, PIG-1608_1.patch pig should always include pig-default.properties and pig.properties as a part of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1608: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk. Thanks Niraj! pig should always include pig-default.properties and pig.properties in the pig.jar -- Key: PIG-1608 URL: https://issues.apache.org/jira/browse/PIG-1608 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: niraj rai Assignee: niraj rai Fix For: 0.9.0 Attachments: PIG-1608_0.patch, PIG-1608_1.patch pig should always include pig-default.properties and pig.properties as a part of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1614) javacc.jar pulled twice from maven repository
javacc.jar pulled twice from maven repository - Key: PIG-1614 URL: https://issues.apache.org/jira/browse/PIG-1614 Project: Pig Issue Type: Bug Components: build Reporter: Daniel Dai Priority: Trivial ant pull javacc.jar twice from maven. One is javacc.jar, and the other is javacc-4.2.jar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1608) pig should always include pig-default.properties and pig.properties in the pig.jar
[ https://issues.apache.org/jira/browse/PIG-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908886#action_12908886 ] Daniel Dai commented on PIG-1608: - pig should include pig-default.properties into pig.jar, but not pig.properties, just like hadoop does for core-default.xml, core-site.xml. pig should always include pig-default.properties and pig.properties in the pig.jar -- Key: PIG-1608 URL: https://issues.apache.org/jira/browse/PIG-1608 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai pig should always include pig-default.properties and pig.properties as a part of the pig.jar file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count;) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count;) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count;) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there
[jira] Commented: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909007#action_12909007 ] Daniel Dai commented on PIG-1605: - Changes are reasonably small. Here is a summary: 1. Add the following methods to the plan (both old and new): {code} public void createSoftLink(E from, E to) public ListE getSoftLinkPredecessors(E op) public ListE getSoftLinkSuccessors(E op) {code} 2. All walkers need to change. When walker get predecessors/successors, it need to get both soft/regular link predecessors. The changes are straight forward, eg from: {code} CollectionO newSuccessors = mPlan.getSuccessors(suc); {code} to: {code} CollectionO newSuccessors = mPlan.getSuccessors(suc); newSuccessors.addAll(mPlan.getSoftLinkSuccessors(suc)); {code} 3. Change plan utility functions, such as replace, replaceAndAddSucessors, replaceAndAddPredecessors, etc In new logical plan, there is no change since we only have minimum utility functions. In old logical plan, there should be some change to make those utility functions aware of soft link, but if we decide not support old logical plan going forward, no change needed, only need to note those utility functions does not deal with soft link within the function. 4. Change scalar to use soft link This include creating soft link, maintaining soft link when doing transform (migrating to new plan, translating to physical plan). 5. Change store-load to use soft link This is an optional step. Currently we use regular link, conceptually we shall use soft link. It is Ok if we don't do this for now. Also note in most cases, there is no soft link, the plan will behave just like before, so this change should be safe enough. Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. 4. With soft link, we can use scalar come from different sources in the same statement, which in my mind is not a rare use case. (eg: D = foreach C generate c0/A.total, c1/B.count; ) Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1604) 'relation as scalar' does not work with complex types
[ https://issues.apache.org/jira/browse/PIG-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908096#action_12908096 ] Daniel Dai commented on PIG-1604: - +1, patch looks good. 'relation as scalar' does not work with complex types -- Key: PIG-1604 URL: https://issues.apache.org/jira/browse/PIG-1604 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1604.1.patch Statement such as sclr = limit b 1; d = foreach a generate name, age/(double)sclr.mapcol#'it' as some_sum; Results in the following parse error: ERROR 1000: Error during parsing. Non-atomic field expected but found atomic field -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1605) Adding soft link to plan to solve input file dependency
Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1605) Adding soft link to plan to solve input file dependency
[ https://issues.apache.org/jira/browse/PIG-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1605: Description: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. was: In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. Adding soft link to plan to solve input file dependency --- Key: PIG-1605 URL: https://issues.apache.org/jira/browse/PIG-1605 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 In scalar implementation, we need to deal with implicit dependencies. [PIG-1603|https://issues.apache.org/jira/browse/PIG-1603] is trying to solve the problem by adding a LOScalar operator. Here is a different approach. We will add a soft link to the plan, and soft link is only visible to the walkers. By doing this, we can make sure we visit LOStore which generate scalar first, and then LOForEach which use the scalar. All other part of the logical plan does not know the existence of the soft link. The benefits are: 1. Logical plan do not need to deal with LOScalar, this makes logical plan cleaner 2. Conceptually scalar dependency is different. Regular link represent a data flow in pipeline. In scalar, the dependency means an operator depends on a file generated by the other operator. It's different type of data dependency. 3. Soft link can solve other dependency problem in the future. If we introduce another UDF dependent on a file generated by another operator, we can use this mechanism to solve it. Currently, there are two cases we can use soft link: 1. scalar dependency, where ReadScalar UDF will use a file generate by a LOStore 2. store-load dependency, where we will load a file which is generated by a store in the same script. This happens in multi-store case. Currently we solve it by regular link. It is better to use a soft link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1322) Logical Optimizer: change outer join into regular join
[ https://issues.apache.org/jira/browse/PIG-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1322: Assignee: Xuefu Zhang (was: Daniel Dai) Fix Version/s: 0.9.0 Logical Optimizer: change outer join into regular join -- Key: PIG-1322 URL: https://issues.apache.org/jira/browse/PIG-1322 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.9.0 In some cases, we can change the outer join into a regular join. The benefit is regular join is easier to optimize in subsequent optimization. Example: C = join A by a0 LEFT OUTER, B by b0; D = filter C by b0 0; = C = join A by a0, B by b0; D = filter C by b0 0; Because we made this change, so PushUpFilter rule can further push the filter in front of regular join which otherwise cannot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1437: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Xuefu Zhang Priority: Minor Fix For: 0.9.0 Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1601. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906932#action_12906932 ] Daniel Dai commented on PIG-1595: - +1 for the test failure fix. casting relation to scalar- problem with handling of data from non PigStorage loaders - Key: PIG-1595 URL: https://issues.apache.org/jira/browse/PIG-1595 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1595.1.patch, PIG-1595.2.patch If load functions that don't follow the same bytearray format as PigStorage for other supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results. The root cause of the problem is that there is a real dependency between the ReadScalars udf that returns the scalar value and the LogicalOperator that acts as its input. But the logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer, the order in which schema is reset and evaluated does not take this into consideration. If the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf, the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other supported types might not be same for the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-11.patch PIG-1178-11.patch change the layout of explain, error code and comments, etc. No real functional changes. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 11 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907061#action_12907061 ] Daniel Dai commented on PIG-1178: - PIG-1178-11.patch committed to both trunk and 0.8 branch. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-11.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906592#action_12906592 ] Daniel Dai commented on PIG-1178: - Patch PIG-1178-10.patch committed. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-10.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, PIG-1178-9.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1594) NullPointerException in new logical planner
[ https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1594. - Resolution: Fixed This issue is fixed by PIG-1178-10.patch. NullPointerException in new logical planner --- Key: PIG-1594 URL: https://issues.apache.org/jira/browse/PIG-1594 Project: Pig Issue Type: Bug Reporter: Andrew Hitchcock Assignee: Daniel Dai Fix For: 0.8.0 I've been testing the trunk version of Pig on Elastic MapReduce against our log processing sample application(1). When I try to run the query it throws a NullPointerException and suggests I disable the new logical plan. Disabling it works and the script succeeds. Here is the query I'm trying to run: {code} register file:/home/hadoop/lib/pig/piggybank.jar DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray); LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] (.+?) (\\S+) (\\S+) ([^]*) ([^]*)')) as (remoteAddr:chararray, remoteLogname:chararray, user:chararray, time:chararray, request:chararray, status:int, bytes_string:chararray, referrer:chararray, browser:chararray); REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[\\?]q=([^]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50; STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT'; {code} And here is the stack trace that results: {code} ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285) at org.apache.pig.PigServer.compilePp(PigServer.java:1301) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154) at org.apache.pig.PigServer.execute(PigServer.java:1148) at org.apache.pig.PigServer.access$100(PigServer.java:123) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464) at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350) at org.apache.pig.PigServer.executeBatch(PigServer.java:324) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:491) at org.apache.pig.Main.main(Main.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NullPointerException at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76) at org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76) at org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111) at org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175) at org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143) at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55) at org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87) at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74) at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:76) at
[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases
[ https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1575: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Complete the migration of optimization rule PushUpFilter including missing test cases - Key: PIG-1575 URL: https://issues.apache.org/jira/browse/PIG-1575 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, jira-1575-4.patch, jira-1575-5.patch The Optimization rule under the new logical plan, PushUpFilter, only does a subset of optimization scenarios compared to the same rule under the old logical plan. For instance, it only considers filter after join, but the old optimization also considers other operators such as CoGroup, Union, Cross, etc. The migration of the rule should be complete. Also, the test cases created for testing the old PushUpFilter wasn't migrated to the new logical plan code base. It should be also migrated. (A few has been migrated in JIRA-1574.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1575) Complete the migration of optimization rule PushUpFilter including missing test cases
[ https://issues.apache.org/jira/browse/PIG-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1575: Attachment: jira-1575-5.patch Patch looks good. Attach the final patch. test patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass. Patch committed to both trunk and 0.8 branch. Complete the migration of optimization rule PushUpFilter including missing test cases - Key: PIG-1575 URL: https://issues.apache.org/jira/browse/PIG-1575 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1575-1.patch, jira-1575-2.patch, jira-1575-3.patch, jira-1575-4.patch, jira-1575-5.patch The Optimization rule under the new logical plan, PushUpFilter, only does a subset of optimization scenarios compared to the same rule under the old logical plan. For instance, it only considers filter after join, but the old optimization also considers other operators such as CoGroup, Union, Cross, etc. The migration of the rule should be complete. Also, the test cases created for testing the old PushUpFilter wasn't migrated to the new logical plan code base. It should be also migrated. (A few has been migrated in JIRA-1574.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906321#action_12906321 ] Daniel Dai commented on PIG-1548: - Patch break TestFRJoin2.testConcatenateJobForScalar3. Comment out TestFRJoin2.testConcatenateJobForScalar3 temporarily. Optimize scalar to consolidate the part file Key: PIG-1548 URL: https://issues.apache.org/jira/browse/PIG-1548 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1548.patch, PIG-1548_1.patch Current scalar implementation will write a scalar file onto dfs. When Pig need the scalar, it will open the dfs file directly. Each scalar file contains more than one part file though it contains only one record. This puts a huge load to namenode. We should consolidate part file before open it. Another optional step is put the consolicated file into distributed cache. This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1595) casting relation to scalar- problem with handling of data from non PigStorage loaders
[ https://issues.apache.org/jira/browse/PIG-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906322#action_12906322 ] Daniel Dai commented on PIG-1595: - Patch break TestScalarAliases.testScalarErrMultipleRowsInInput. Comment out TestScalarAliases.testScalarErrMultipleRowsInInput temporarily. casting relation to scalar- problem with handling of data from non PigStorage loaders - Key: PIG-1595 URL: https://issues.apache.org/jira/browse/PIG-1595 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1595.1.patch If load functions that don't follow the same bytearray format as PigStorage for other supported datatypes, or those that don't implement the LoadCaster interface are used in 'casting relation to scalar' (PIG-1434), it can cause the query to fail or create incorrect results. The root cause of the problem is that there is a real dependency between the ReadScalars udf that returns the scalar value and the LogicalOperator that acts as its input. But the logicalplan does not capture this dependency. So in SchemaResetter visitor used by the optimizer, the order in which schema is reset and evaluated does not take this into consideration. If the schema of the input LogicalOperator does not get evaluated before the ReadScalar udf, the resutltype of ReadScalar udf becomes bytearray. POUserFunc will convert the input to bytearray using ' new DataByteArray(inp.toString().getBytes())'. But this bytearray encoding of other supported types might not be same for the LoadFunction associated with the column, and that can result in problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.
[ https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905966#action_12905966 ] Daniel Dai commented on PIG-1591: - +1. No unit test needed since it is about error message. Manually tested and it works. Will commit it shortly. pig does not create a log file, if tje MR job succeeds but front end fails. --- Key: PIG-1591 URL: https://issues.apache.org/jira/browse/PIG-1591 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Attachments: pig_1591.patch When I run this script: A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_b' as (b1:int); C =COGROUP A by a1, B by b1; C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); dump D1; The MR job succeeds but the pig job fails with the fillowing error: 2010-08-31 13:33:09,960 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2010-08-31 13:33:09,962 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2010-08-31 13:33:09,964 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,965 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2010-08-31 13:33:09,969 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2010-08-31 13:33:09,969 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2010-08-31 13:33:09,973 [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple since MR job is succeeded, so the pig does not create any log file, but it should still create a log file, giving the cause of failure in the pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1591) pig does not create a log file, if tje MR job succeeds but front end fails.
[ https://issues.apache.org/jira/browse/PIG-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1591. - Hadoop Flags: [Reviewed] Fix Version/s: 0.8.0 Resolution: Fixed Patch committed to both trunk and 0.8 branch. pig does not create a log file, if tje MR job succeeds but front end fails. --- Key: PIG-1591 URL: https://issues.apache.org/jira/browse/PIG-1591 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 Attachments: pig_1591.patch When I run this script: A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_b' as (b1:int); C =COGROUP A by a1, B by b1; C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); dump D1; The MR job succeeds but the pig job fails with the fillowing error: 2010-08-31 13:33:09,960 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2010-08-31 13:33:09,962 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2010-08-31 13:33:09,964 [main] INFO org.apache.pig.impl.io.InterStorage - Pig Internal storage in use 2010-08-31 13:33:09,965 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2010-08-31 13:33:09,969 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2010-08-31 13:33:09,969 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2010-08-31 13:33:09,973 [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple since MR job is succeeded, so the pig does not create any log file, but it should still create a log file, giving the cause of failure in the pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1601) Make scalar work for secure hadoop
Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1601) Make scalar work for secure hadoop
[ https://issues.apache.org/jira/browse/PIG-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1601: Attachment: PIG-1601-1.patch Make scalar work for secure hadoop -- Key: PIG-1601 URL: https://issues.apache.org/jira/browse/PIG-1601 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1601-1.patch Error message: open file 'hdfs://gsbl90890.blue.ygrid.yahoo.com/tmp/temp851711738/tmp727366271'; error = java.io.IOException: Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:4975) at org.apache.hadoop.hdfs.server.namenode.NameNode.getDelegationToken(NameNode.java:432) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295) at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:66) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:313) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:448) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:441) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide.getNext(Divide.java:72) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:358) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905587#action_12905587 ] Daniel Dai commented on PIG-1543: - test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests pass IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1543-1.patch 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1587) Cloning utility functions for new logical plan
[ https://issues.apache.org/jira/browse/PIG-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1587: Description: We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); LogicalExpressionPlan copyAbove(LogicalExpression leave, LogicalRelationalOperator attachedRelationalOp, boolean keepUid); LogicalExpressionPlan copyBelow(LogicalExpression root, LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Create a new logical expression plan and copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} PairListOperator, ListOperator merge(LogicalExpressionPlan plan, LogicalRelationalOperator attachedRelationalOp); {code} * Merge plan into the current logical expression plan as an independent tree * attachedRelationalOp is the destination operator new logical expression plan attached to * return the sources/sinks of this independent tree LogicalPlan.java {code} LogicalPlan copy(LOForEach foreach, boolean keepUid); LogicalPlan copyAbove(LogicalRelationalOperator leave, LOForEach foreach, boolean keepUid); LogicalPlan copyBelow(LogicalRelationalOperator root, LOForEach foreach, boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Create a new logical plan and copy relational operator along with connection * Copy all expression plans inside relational operator, set plan and attachedRelationalOp properly * If the plan is ForEach inner plan, param foreach is the destination ForEach operator; otherwise, pass null {code} PairListOperator, ListOperator merge(LogicalPlan plan, LOForEach foreach); {code} * Merge plan into the current logical plan as an independent tree * foreach is the destination LOForEach is the destination plan is a ForEach inner plan; otherwise, pass null * return the sources/sinks of this independent tree was: We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} ListOperator merge(LogicalExpressionPlan plan); {code} * Merge plan into the current logical expression plan as an independent tree * return the sources of this independent tree LogicalPlan.java {code} LogicalPlan copy(boolean keepUid); {code} * Main use
[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1543: Status: Patch Available (was: Open) IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1543-1.patch 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray
[ https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905293#action_12905293 ] Daniel Dai commented on PIG-1572: - Patch looks good. One minor doubt is when we migrate to new logical plan, UserFuncExpression already have necessary cast inserted, seems we do not need to change new logical plan's UserFuncExpression.getFieldSchema(), am I right? change default datatype when relations are used as scalar to bytearray -- Key: PIG-1572 URL: https://issues.apache.org/jira/browse/PIG-1572 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1572.1.patch, PIG-1572.2.patch When relations are cast to scalar, the current default type is chararray. This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken
piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: (was: PIG-1583-1.patch) piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken
[ https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1583: Attachment: PIG-1583-1.patch piggybank unit test TestLookupInFiles is broken --- Key: PIG-1583 URL: https://issues.apache.org/jira/browse/PIG-1583 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1583-1.patch Error message: 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from attempt_20100831093139211_0001_m_00_3: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles [LookupInFiles : Cannot open file one] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.IOException: LookupInFiles : Cannot open file one at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) ... 10 more Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one does not exist at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224) at org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172) at org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89) ... 13 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1587) Cloning utility functions for new logical plan
Cloning utility functions for new logical plan -- Key: PIG-1587 URL: https://issues.apache.org/jira/browse/PIG-1587 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.9.0 We sometimes need to copy a logical operator/plan when writing an optimization rule. Currently copy an operator/plan is awkward. We need to write some utilities to facilitate this process. Swati contribute PIG-1510 but we feel it still cannot address most use cases. I propose to add some more utilities into new logical plan: all LogicalExpressions: {code} copy(LogicalExpressionPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical expression operator (except for fieldSchema, uidOnlySchema, ProjectExpression.attachedRelationalOp) * Set the plan to newPlan * If keepUid is true, further copy uidOnlyFieldSchema all LogicalRelationalOperators: {code} copy(LogicalPlan newPlan, boolean keepUid); {code} * Do a shallow copy of the logical relational operator (except for schema, uid related fields) * Set the plan to newPlan; * If the operator have inner plan/expression plan, copy the whole inner plan with the same keepUid flag (Especially, LOInnerLoad will copy its inner project, with the same keepUid flag) * If keepUid is true, further copy uid related fields (LOUnion.uidMapping, LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids) LogicalExpressionPlan.java {code} LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, boolean keepUid); {code} * Copy expression operator along with connection with the same keepUid flag * Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp parameter {code} ListOperator merge(LogicalExpressionPlan plan); {code} * Merge plan into the current logical expression plan as an independent tree * return the sources of this independent tree LogicalPlan.java {code} LogicalPlan copy(boolean keepUid); {code} * Main use case to copy inner plan of ForEach * Copy all relational operator along with connection * Copy all expression plans inside relational operator, set plan and attachedRelationalOp properly {code} ListOperator merge(LogicalPlan plan); {code} * Merge plan into the current logical plan as an independent tree * return the sources of this independent tree -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1574) Optimization rule PushUpFilter causes filter to be pushed up out joins
[ https://issues.apache.org/jira/browse/PIG-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1574: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed test-patch result: jira-1574-1.patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. This patch does not push filter before join if the join is outer join. Actually we can push filter to the outer side of the join. I assume it will be addressed in PIG-1575. Patch jira-1574-1.patch committed. Thanks Xuefu! Optimization rule PushUpFilter causes filter to be pushed up out joins -- Key: PIG-1574 URL: https://issues.apache.org/jira/browse/PIG-1574 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1574-1.patch The PushUpFilter optimization rule in the new logical plan moves the filter up to one of the join branch. It does this aggressively by find an operator that has all the projection UIDs. However, it didn't consider that the found operator might be another join. If that join is outer, then we cannot simply move the filter to one of its branches. As an example, the following script will be erroneously optimized: A = load 'myfile' as (d1:int); B = load 'anotherfile' as (d2:int); C = join A by d1 full outer, B by d2; D = load 'xxx' as (d3:int); E = join C by d1, D by d3; F = filter E by d1 5; G = store F into 'dummy'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1568: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Patch committed. Thanks Xuefu! Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1568 URL: https://issues.apache.org/jira/browse/PIG-1568 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1568-1.patch, jira-1568-1.patch FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1579: Attachment: PIG-1579-1.patch Attach a fix. However, this fix is shallow and may need an in-depth look. Commit the temporary fix and leave the Jira open. Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1579-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1579: Description: Error message: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function: Traceback (most recent call last): File iostream, line 5, in multStr TypeError: can't multiply sequence by non-int of type 'NoneType' at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput --- Key: PIG-1579 URL: https://issues.apache.org/jira/browse/PIG-1579 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1579-1.patch Error message: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function: Traceback (most recent call last): File iostream, line 5, in multStr TypeError: can't multiply sequence by non-int of type 'NoneType' at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-365) Map side optimization for Limit (top k case)
[ https://issues.apache.org/jira/browse/PIG-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-365. Resolution: Won't Fix Map side optimization for Limit (top k case) Key: PIG-365 URL: https://issues.apache.org/jira/browse/PIG-365 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Priority: Minor In map side, only collect top k records to improve performance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-365) Map side optimization for Limit (top k case)
[ https://issues.apache.org/jira/browse/PIG-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903996#action_12903996 ] Daniel Dai commented on PIG-365: Hi, Gianmarco, Yes, you are right. This is a quite old Jira and it is no longer applicable. I will close this Jira. More recent limit optimization we are still looking at is [PIG-1270|https://issues.apache.org/jira/browse/PIG-1270]. Map side optimization for Limit (top k case) Key: PIG-365 URL: https://issues.apache.org/jira/browse/PIG-365 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Priority: Minor In map side, only collect top k records to improve performance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903803#action_12903803 ] Daniel Dai commented on PIG-1178: - test-patch result for PIG-11780-8: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Patch committed. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903263#action_12903263 ] Daniel Dai commented on PIG-506: Patch looks good. One minor comment, PlanHelper.LoadStoreFinder may better be PlanHelper.LoadStoreNativeFinder. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1512) PlanPrinter does not print LOJoin operator in the new logical optimization framework
[ https://issues.apache.org/jira/browse/PIG-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1512: Status: Resolved (was: Patch Available) Resolution: Fixed This is already fixed in the latest code. Thanks Swati! PlanPrinter does not print LOJoin operator in the new logical optimization framework Key: PIG-1512 URL: https://issues.apache.org/jira/browse/PIG-1512 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Fix For: 0.8.0 Attachments: printJoin.patch PlanPrinter does not print LOJoin relational operator. As such, the LOJoin operator would not get printed when we do an explain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1321: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Thanks Xuefu! Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, jira-1321-3.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1321: Attachment: jira-1321-3.patch Repost the pre-condition: 1. two consecutive foreach statements. 2. the second foreach statement is a simple inner plan in which the ognly statement is a GENERATE statement. In other words, the second foreach statement must be something like FOREACH A GENERATE 3. The first foreach statement cannot contain flatten due to its complexity 4. No 1st foreach output is referred more than once in second foreach, eg: B = foreach ; C = foreach B generate $0, $1, $0 will not be merged. The reason if we merge, $0 will be calculated twice, which defeat the benefit of merging. All tests pass. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1321-2.patch, jira-1321-3.patch, pig-1321.patch We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten
[ https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1515: Attachment: jira-1515-2.patch All tests pass. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Migrate logical optimization rule: PushDownForeachFlatten - Key: PIG-1515 URL: https://issues.apache.org/jira/browse/PIG-1515 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1515-1.patch, jira-1515-2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten
[ https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1515: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Thanks Xuefu! Migrate logical optimization rule: PushDownForeachFlatten - Key: PIG-1515 URL: https://issues.apache.org/jira/browse/PIG-1515 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1515-1.patch, jira-1515-2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-8.patch PIG-1178-8.patch fix TestPruneColumn.testMapKey3 LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, PIG-1178-8.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1514: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Did a combined test-patch with PIG-1497: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 80 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 443 release audit warnings (more than the trunk's current 433 warnings). All new source code has license header except for test benchmarks (new-optlimitplan*.dot) Patch committed. Thanks Xuefu! Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch, jira-1514-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1497) Mandatory rule PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1497: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Did a combined test-patch with PIG-1514: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 80 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 443 release audit warnings (more than the trunk's current 433 warnings). All new source code have the license header. Patch committed. Thanks Xuefu! Mandatory rule PartitionFilterOptimizer --- Key: PIG-1497 URL: https://issues.apache.org/jira/browse/PIG-1497 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1497-0.patch Need to migrate PartitionFilterOptimizer to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1514: Status: Patch Available (was: Open) Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch, jira-1514-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.