[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345598#comment-16345598 ]
Thomas Bünger commented on SPARK-12394: --------------------------------------- Any news on this issue? Is it really fixed? I also can't find a corresponding pull request. > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Assignee: Nong Li > Priority: Major > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org