Re: can spark take advantage of ordered data?

2017-03-10 Thread Jonathan Coveney
While I was at Two Sigma I ended up implementing something similar to what Koert described... you can check it out here: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala. They've built a lot more on top of this (including support for dataframes et

Re: can spark take advantage of ordered data?

2017-03-10 Thread Yong Zhang
ert to confirm. Yong From: sourabh chaki Sent: Friday, March 10, 2017 9:03 AM To: Imran Rashid Cc: Jonathan Coveney; user@spark.apache.org Subject: Re: can spark take advantage of ordered data? My use case is also quite similar. I have 2 feeds. One 3TB and anothe

Re: can spark take advantage of ordered data?

2017-03-10 Thread Koert Kuipers
this shouldn't be too hard. adding something to spark-sorted or to the dataframe/dataset logical plan that says "trust me, i am already partitioned and sorted" seems doable. however you most likely need a custom hash partitioner, and you have to be careful to read the data in without file splitting

Re: can spark take advantage of ordered data?

2017-03-10 Thread sourabh chaki
My use case is also quite similar. I have 2 feeds. One 3TB and another 100GB. Both the feeds are generated by hadoop reduce operation and partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas 100GB file has 200 partitions. Now when I do a join between these two feeds using spa

Re: can spark take advantage of ordered data?

2015-03-11 Thread Imran Rashid
Hi Jonathan, you might be interested in https://issues.apache.org/jira/browse/SPARK-3655 (not yet available) and https://github.com/tresata/spark-sorted (not part of spark, but it is available right now). Hopefully thats what you are looking for. To the best of my knowledge that covers what is a

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964
RangePartitioner? At least for join, you can implement your own partitioner, to utilize the sorted data. Just my 2 cents. Date: Wed, 11 Mar 2015 17:38:04 -0400 Subject: can spark take advantage of ordered data? From: jcove...@gmail.com To: User@spark.apache.org Hello all, I am wondering if spark