[jira] [Commented] (PIG-4449) Optimize the case of Order by + Limit in nested foreach

2017-05-19 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018164#comment-16018164
 ] 

Rohini Palaniswamy commented on PIG-4449:
-

PIG-5211 has addressed this problem. The current implementation in PIG-5211 
adds to PriorityQueue and then removes an element if it exceeds size limit.  
Leaving this jira open to address further optimizations that I had thought of 
for this issue.

1) Change to using 
https://google.github.io/guava/releases/11.0/api/docs/com/google/common/collect/TreeMultiset.html
 . Internally it is Map which keeps count of duplicate 
entries. That should save space. Also it allows peeking of first and last 
entry. So after reaching the limit we can check if the new element to be added 
is greater than the last entry in case of ascending sort and or lesser than the 
smaller entry in case of descending sort and avoid adding in the first place.
2) Use https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html in case 
it is a DISTINCT + ORDER BY +LIMIT
3) Add support for spill. Have seen cases where people do LIMIT 10.

> Optimize the case of Order by + Limit in nested foreach
> ---
>
> Key: PIG-4449
> URL: https://issues.apache.org/jira/browse/PIG-4449
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>  Labels: Performance
>
> This is one of the very frequently used patterns
> {code}
> grouped_data_set = group data_set by id;
> capped_data_set = foreach grouped_data_set
> {
>   ordered = order joined_data_set by timestamp desc;
>   capped = limit ordered $num;
>  generate flatten(capped);
> };
> {code}
> But this performs very poorly when there are millions of rows for a key in 
> the groupby with lot of spills.  This can be easily optimized by pushing the 
> limit into the InternalSortedBag and maintain only $num records any time and 
> avoid memory pressure.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017893#comment-16017893
 ] 

Jeff Zhang commented on PIG-5157:
-

A lot of users are still using spark 1.x as spark 2 is incompatible with spark 
1.x. And I don't think spark 1.x will be dropped in short time. So I think we 
should still support spark 1.x. And actually I would suggest to use spark 1.x 
as the only supported version of pig on spark. Because I think pig on spark has 
already behind the schedule, and lots of people are looking forward that. 
Adding support for spark 2 would take more time and effort, and may bring in 
some issues, so I would suggest to only support spark 1.x in the first release 
of pig on spark. For users, it is transparent and it is easy to upgrade from 
spark1 to spark2.

Supporting to spark2 could be done in the next release, maybe also changing 
from the rdd api to dataframe api in the next release. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017712#comment-16017712
 ] 

Rohini Palaniswamy commented on PIG-5157:
-

I am fine with supporting Spark 2.x or supporting both versions.

This depends on two things.
1) How well Spark 2 is adopted and how many distributions or users are 
still on Spark 1.x
2) When is spark community planning to deprecate or EOL Spark 1.x

[~szita] and [~nkollar] might be able to make a better call based on their 
users. 

[~zjffdu], 
   Do you have knowledge when support for Spark 1.x will be dropped by Spark 
community?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5236) json simple jar not included automatically while trying to load multiple schema in pig using avro

2017-05-19 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5236:
-
Attachment: PIG-5236-2.patch

> json simple jar not included automatically while trying to load multiple 
> schema in pig using avro
> -
>
> Key: PIG-5236
> URL: https://issues.apache.org/jira/browse/PIG-5236
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Minor
> Attachments: PIG-5236-1.patch, PIG-5236-2.patch
>
>
> It would be good to include json simple jar by default similar to joda-time 
> or pig.jar



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017009#comment-16017009
 ] 

Nandor Kollar commented on PIG-5157:


We can use reflection, or we can also use shims instead.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017007#comment-16017007
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~rohini],[~xuefuz],[~zjffdu]: Should we support spark2 or support both 
spark1.6 and spark2?  It may use reflection to support both version(still 
investigation).  Please give us your opinion, in my view, we don't suppport 
spark1.6 if we upgrade to spark2.0.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5207) BugFix e2e tests fail on spark

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016999#comment-16016999
 ] 

liyunzhang_intel commented on PIG-5207:
---

[~rohini]: can you spend some time to view the modification of 
PhysicalPlan.java.

> BugFix e2e tests fail on spark
> --
>
> Key: PIG-5207
> URL: https://issues.apache.org/jira/browse/PIG-5207
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-5207.0.patch, PIG-5207.1.patch
>
>
> Observed ClassCastException in BugFix 1 and 2 test cases. The exception is 
> thrown from and UDF: COR.Final



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] Subscription: PIG patch available

2017-05-19 Thread jira
Issue Subscription
Filter: PIG patch available (37 issues)

Subscriber: pigdaily

Key Summary
PIG-5236json simple jar not included automatically while trying to load 
multiple schema in pig using avro
https://issues.apache.org/jira/browse/PIG-5236
PIG-5225Several unit tests are not annotated with @Test
https://issues.apache.org/jira/browse/PIG-5225
PIG-5207BugFix e2e tests fail on spark
https://issues.apache.org/jira/browse/PIG-5207
PIG-5194HiveUDF fails with Spark exec type
https://issues.apache.org/jira/browse/PIG-5194
PIG-5185Job name show "DefaultJobName" when running a Python script
https://issues.apache.org/jira/browse/PIG-5185
PIG-5184set command to view value of a variable
https://issues.apache.org/jira/browse/PIG-5184
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4748DateTimeWritable forgets Chronology
https://issues.apache.org/jira/browse/PIG-4748
PIG-4745DataBag should protect content of passed list of tuples
https://issues.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384