Christopher, can you expand on the co-partitioning support?

We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key.  Our analytics team wants to
do frequent joins across these different data-sets based on this key.  It
makes sense that if the data for each key across 'tables' was co-located on
the same server, shuffles could be minimized and ultimately performance
could be much better.

>From reading the HDFS issue I posted before, the way is being paved for
implementing this type of behavior though there are a lot of complications
to make it work I believe.


On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen <c...@adatao.com> wrote:

> Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
>
> If the former, Spark does support copartitioning.
>
> If the latter, it's an HDFS scope that's outside of Spark. On that note,
> Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
> sure the paper makes useful contributions for its set of use cases.
>
> Sent while mobile. Pls excuse typos etc.
> On Aug 26, 2014 5:21 AM, "Gary Malouf" <malouf.g...@gmail.com> wrote:
>
>> It appears support for this type of control over block placement is going
>> out in the next version of HDFS:
>> https://issues.apache.org/jira/browse/HDFS-2576
>>
>>
>> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.g...@gmail.com>
>> wrote:
>>
>> > One of my colleagues has been questioning me as to why Spark/HDFS makes
>> no
>> > attempts to try to co-locate related data blocks.  He pointed to this
>> > paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
>> the
>> > CoHadoop research and the performance improvements it yielded for
>> > Map/Reduce jobs.
>> >
>> > Would leveraging these ideas for writing data from Spark make sense/be
>> > worthwhile?
>> >
>> >
>> >
>>
>

Reply via email to