Hi Michael,

I think once that work is into HDFS, it will be great to expose this
functionality via Spark.  This is something worth pursuing because it could
grant orders of magnitude perf improvements in cases when people need to
join data.

The second item would be very interesting, could yield significant
performance boosts.

Best,

Gary


On Tue, Aug 26, 2014 at 6:50 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> It seems like there are two things here:
>  - Co-locating blocks with the same keys to avoid network transfer.
>  - Leveraging partitioning information to avoid a shuffle when data is
> already partitioned correctly (even if those partitions aren't yet on the
> same machine).
>
> The former seems more complicated and probably requires the support from
> Hadoop you linked to.  However, the latter might be easier as there is
> already a framework for reasoning about partitioning and the need to
> shuffle in the Spark SQL planner.
>
>
> On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf <malouf.g...@gmail.com>
> wrote:
>
>> Christopher, can you expand on the co-partitioning support?
>>
>> We have a number of spark SQL tables (saved in parquet format) that all
>> could be considered to have a common hash key.  Our analytics team wants
>> to
>> do frequent joins across these different data-sets based on this key.  It
>> makes sense that if the data for each key across 'tables' was co-located
>> on
>> the same server, shuffles could be minimized and ultimately performance
>> could be much better.
>>
>> From reading the HDFS issue I posted before, the way is being paved for
>> implementing this type of behavior though there are a lot of complications
>> to make it work I believe.
>>
>>
>> On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen <c...@adatao.com>
>> wrote:
>>
>> > Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
>> >
>> > If the former, Spark does support copartitioning.
>> >
>> > If the latter, it's an HDFS scope that's outside of Spark. On that note,
>> > Hadoop does also make attempts to collocate data, e.g., rack awareness.
>> I'm
>> > sure the paper makes useful contributions for its set of use cases.
>> >
>> > Sent while mobile. Pls excuse typos etc.
>> > On Aug 26, 2014 5:21 AM, "Gary Malouf" <malouf.g...@gmail.com> wrote:
>> >
>> >> It appears support for this type of control over block placement is
>> going
>> >> out in the next version of HDFS:
>> >> https://issues.apache.org/jira/browse/HDFS-2576
>> >>
>> >>
>> >> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.g...@gmail.com>
>> >> wrote:
>> >>
>> >> > One of my colleagues has been questioning me as to why Spark/HDFS
>> makes
>> >> no
>> >> > attempts to try to co-locate related data blocks.  He pointed to this
>> >> > paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
>> >> the
>> >> > CoHadoop research and the performance improvements it yielded for
>> >> > Map/Reduce jobs.
>> >> >
>> >> > Would leveraging these ideas for writing data from Spark make
>> sense/be
>> >> > worthwhile?
>> >> >
>> >> >
>> >> >
>> >>
>> >
>>
>
>

Reply via email to