[jira] [Commented] (SPARK-11410) Add a DataFrame API that provides functionality similar to HiveQL's DISTRIBUTE BY

Yin Huai (JIRA) Sun, 13 Dec 2015 14:38:09 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055198#comment-15055198
 ]


Yin Huai commented on SPARK-11410:
----------------------------------

oh, i see. This is the table partitioning mechanism. If you use partitionBy 
before writing this table, we will understand this table is partitioned by 
column {{column}} and can skip unnecessary partitions when scan the table. 

The jira is actually for another feature, which lets users to control how to 
shuffle data by using the hash value of given columns.

> Add a DataFrame API that provides functionality similar to HiveQL's 
> DISTRIBUTE BY
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-11410
>                 URL: https://issues.apache.org/jira/browse/SPARK-11410
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Nong Li
>            Assignee: Nong Li
>             Fix For: 1.6.0
>
>
> DISTRIBUTE BY allows the user to control the partitioning and ordering of a 
> data set which can be very useful for some applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11410) Add a DataFrame API that provides functionality similar to HiveQL's DISTRIBUTE BY

Reply via email to