[jira] [Commented] (PIG-4449) Optimize the case of Order by + Limit in nested foreach
[ https://issues.apache.org/jira/browse/PIG-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018164#comment-16018164 ] Rohini Palaniswamy commented on PIG-4449: - PIG-5211 has addressed this problem. The current implementation in PIG-5211 adds to PriorityQueue and then removes an element if it exceeds size limit. Leaving this jira open to address further optimizations that I had thought of for this issue. 1) Change to using https://google.github.io/guava/releases/11.0/api/docs/com/google/common/collect/TreeMultiset.html . Internally it is Mapwhich keeps count of duplicate entries. That should save space. Also it allows peeking of first and last entry. So after reaching the limit we can check if the new element to be added is greater than the last entry in case of ascending sort and or lesser than the smaller entry in case of descending sort and avoid adding in the first place. 2) Use https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html in case it is a DISTINCT + ORDER BY +LIMIT 3) Add support for spill. Have seen cases where people do LIMIT 10. > Optimize the case of Order by + Limit in nested foreach > --- > > Key: PIG-4449 > URL: https://issues.apache.org/jira/browse/PIG-4449 > Project: Pig > Issue Type: Improvement >Reporter: Rohini Palaniswamy > Labels: Performance > > This is one of the very frequently used patterns > {code} > grouped_data_set = group data_set by id; > capped_data_set = foreach grouped_data_set > { > ordered = order joined_data_set by timestamp desc; > capped = limit ordered $num; > generate flatten(capped); > }; > {code} > But this performs very poorly when there are millions of rows for a key in > the groupby with lot of spills. This can be easily optimized by pushing the > limit into the InternalSortedBag and maintain only $num records any time and > avoid memory pressure. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0
[ https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017893#comment-16017893 ] Jeff Zhang commented on PIG-5157: - A lot of users are still using spark 1.x as spark 2 is incompatible with spark 1.x. And I don't think spark 1.x will be dropped in short time. So I think we should still support spark 1.x. And actually I would suggest to use spark 1.x as the only supported version of pig on spark. Because I think pig on spark has already behind the schedule, and lots of people are looking forward that. Adding support for spark 2 would take more time and effort, and may bring in some issues, so I would suggest to only support spark 1.x in the first release of pig on spark. For users, it is transparent and it is easy to upgrade from spark1 to spark2. Supporting to spark2 could be done in the next release, maybe also changing from the rdd api to dataframe api in the next release. > Upgrade to Spark 2.0 > > > Key: PIG-5157 > URL: https://issues.apache.org/jira/browse/PIG-5157 > Project: Pig > Issue Type: Improvement > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > > Upgrade to Spark 2.0 (or latest) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0
[ https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017712#comment-16017712 ] Rohini Palaniswamy commented on PIG-5157: - I am fine with supporting Spark 2.x or supporting both versions. This depends on two things. 1) How well Spark 2 is adopted and how many distributions or users are still on Spark 1.x 2) When is spark community planning to deprecate or EOL Spark 1.x [~szita] and [~nkollar] might be able to make a better call based on their users. [~zjffdu], Do you have knowledge when support for Spark 1.x will be dropped by Spark community? > Upgrade to Spark 2.0 > > > Key: PIG-5157 > URL: https://issues.apache.org/jira/browse/PIG-5157 > Project: Pig > Issue Type: Improvement > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > > Upgrade to Spark 2.0 (or latest) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (PIG-5236) json simple jar not included automatically while trying to load multiple schema in pig using avro
[ https://issues.apache.org/jira/browse/PIG-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Satish Subhashrao Saley updated PIG-5236: - Attachment: PIG-5236-2.patch > json simple jar not included automatically while trying to load multiple > schema in pig using avro > - > > Key: PIG-5236 > URL: https://issues.apache.org/jira/browse/PIG-5236 > Project: Pig > Issue Type: Bug >Reporter: Satish Subhashrao Saley >Assignee: Satish Subhashrao Saley >Priority: Minor > Attachments: PIG-5236-1.patch, PIG-5236-2.patch > > > It would be good to include json simple jar by default similar to joda-time > or pig.jar -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0
[ https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017009#comment-16017009 ] Nandor Kollar commented on PIG-5157: We can use reflection, or we can also use shims instead. > Upgrade to Spark 2.0 > > > Key: PIG-5157 > URL: https://issues.apache.org/jira/browse/PIG-5157 > Project: Pig > Issue Type: Improvement > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > > Upgrade to Spark 2.0 (or latest) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0
[ https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017007#comment-16017007 ] liyunzhang_intel commented on PIG-5157: --- [~rohini],[~xuefuz],[~zjffdu]: Should we support spark2 or support both spark1.6 and spark2? It may use reflection to support both version(still investigation). Please give us your opinion, in my view, we don't suppport spark1.6 if we upgrade to spark2.0. > Upgrade to Spark 2.0 > > > Key: PIG-5157 > URL: https://issues.apache.org/jira/browse/PIG-5157 > Project: Pig > Issue Type: Improvement > Components: spark >Reporter: Nandor Kollar >Assignee: Nandor Kollar > Fix For: spark-branch > > > Upgrade to Spark 2.0 (or latest) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (PIG-5207) BugFix e2e tests fail on spark
[ https://issues.apache.org/jira/browse/PIG-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016999#comment-16016999 ] liyunzhang_intel commented on PIG-5207: --- [~rohini]: can you spend some time to view the modification of PhysicalPlan.java. > BugFix e2e tests fail on spark > -- > > Key: PIG-5207 > URL: https://issues.apache.org/jira/browse/PIG-5207 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Adam Szita >Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5207.0.patch, PIG-5207.1.patch > > > Observed ClassCastException in BugFix 1 and 2 test cases. The exception is > thrown from and UDF: COR.Final -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (37 issues) Subscriber: pigdaily Key Summary PIG-5236json simple jar not included automatically while trying to load multiple schema in pig using avro https://issues.apache.org/jira/browse/PIG-5236 PIG-5225Several unit tests are not annotated with @Test https://issues.apache.org/jira/browse/PIG-5225 PIG-5207BugFix e2e tests fail on spark https://issues.apache.org/jira/browse/PIG-5207 PIG-5194HiveUDF fails with Spark exec type https://issues.apache.org/jira/browse/PIG-5194 PIG-5185Job name show "DefaultJobName" when running a Python script https://issues.apache.org/jira/browse/PIG-5185 PIG-5184set command to view value of a variable https://issues.apache.org/jira/browse/PIG-5184 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues.apache.org/jira/browse/PIG-5160 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues.apache.org/jira/browse/PIG-4750 PIG-4748DateTimeWritable forgets Chronology https://issues.apache.org/jira/browse/PIG-4748 PIG-4745DataBag should protect content of passed list of tuples https://issues.apache.org/jira/browse/PIG-4745 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384