While I was at Two Sigma I ended up implementing something similar to what
Koert described... you can check it out here:
https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala.
They've built a lot more on top of this (including support for dataframes
et
ert to confirm.
Yong
From: sourabh chaki
Sent: Friday, March 10, 2017 9:03 AM
To: Imran Rashid
Cc: Jonathan Coveney; user@spark.apache.org
Subject: Re: can spark take advantage of ordered data?
My use case is also quite similar. I have 2 feeds. One 3TB and anothe
this shouldn't be too hard. adding something to spark-sorted or to the
dataframe/dataset logical plan that says "trust me, i am already
partitioned and sorted" seems doable. however you most likely need a custom
hash partitioner, and you have to be careful to read the data in without
file splitting
My use case is also quite similar. I have 2 feeds. One 3TB and another
100GB. Both the feeds are generated by hadoop reduce operation and
partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
100GB file has 200 partitions.
Now when I do a join between these two feeds using spa
Hi Jonathan,
you might be interested in https://issues.apache.org/jira/browse/SPARK-3655
(not yet available) and https://github.com/tresata/spark-sorted (not part
of spark, but it is available right now). Hopefully thats what you are
looking for. To the best of my knowledge that covers what is a
RangePartitioner?
At least for join, you can implement your own partitioner, to utilize the
sorted data.
Just my 2 cents.
Date: Wed, 11 Mar 2015 17:38:04 -0400
Subject: can spark take advantage of ordered data?
From: jcove...@gmail.com
To: User@spark.apache.org
Hello all,
I am wondering if spark