[ 
https://issues.apache.org/jira/browse/SPARK-14044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Tiernay updated SPARK-14044:
--------------------------------
    Description: 
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. This 
{{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
even get Spark to succeed due to the sort step and data skew in the partitions. 
In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.

  was:
It would be very useful to allow the disabling of this block of code within 
{{DynamicPartitionWriterContainer#writeRows}} at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream {{groupBy}} has already sorted a great many 
fine grained groups which are the target of the {{partitionBy}}. Currently, we 
can't even get Spark to succeed due to the sort step and data skew in the 
partitions. In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the 
{{groupBy}} and the {{partitionBy}} by way of some runtime configuration.


> Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass 
> sort step
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-14044
>                 URL: https://issues.apache.org/jira/browse/SPARK-14044
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Bob Tiernay
>
> It would be very useful to allow the disabling of this block of code within 
> {{DynamicPartitionWriterContainer#writeRows}} at runtime:
> https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418
> The use case is that an upstream {{groupBy}} has already sorted a great many 
> fine grained groups which are the target of the {{partitionBy}}. This 
> {{partitionBy}} shares the same keys as the {{groupBy}}. Currently, we can't 
> even get Spark to succeed due to the sort step and data skew in the 
> partitions. In general, this would make more efficient use of cluster 
> resources.
> For this to work, there needs to be a way to communicate between the 
> {{groupBy}} and the {{partitionBy}} by way of some runtime configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to