Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense that if the data for each key across 'tables' was co-located on the same server, shuffles could be minimized and ultimately performance could be much better.
>From reading the HDFS issue I posted before, the way is being paved for implementing this type of behavior though there are a lot of complications to make it work I believe. On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen <c...@adatao.com> wrote: > Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? > > If the former, Spark does support copartitioning. > > If the latter, it's an HDFS scope that's outside of Spark. On that note, > Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm > sure the paper makes useful contributions for its set of use cases. > > Sent while mobile. Pls excuse typos etc. > On Aug 26, 2014 5:21 AM, "Gary Malouf" <malouf.g...@gmail.com> wrote: > >> It appears support for this type of control over block placement is going >> out in the next version of HDFS: >> https://issues.apache.org/jira/browse/HDFS-2576 >> >> >> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.g...@gmail.com> >> wrote: >> >> > One of my colleagues has been questioning me as to why Spark/HDFS makes >> no >> > attempts to try to co-locate related data blocks. He pointed to this >> > paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on >> the >> > CoHadoop research and the performance improvements it yielded for >> > Map/Reduce jobs. >> > >> > Would leveraging these ideas for writing data from Spark make sense/be >> > worthwhile? >> > >> > >> > >> >