Mark Hamstra created SPARK-18631:
------------------------------------
Summary: Avoid making data skew worse in ExchangeCoordinator
Key: SPARK-18631
URL: https://issues.apache.org/jira/browse/SPARK-18631
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.2, 1.6.3, 2.1.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
The logic to resize partitions in the ExchangeCoordinator is to not start a new
partition until the targetPostShuffleInputSize is equalled or exceeded. This
can make data skew problems worse since a number of small partitions can first
be combined as long as the combined size remains smaller than the
targetPostShuffleInputSize, and then a large, data-skewed partition can be
further combined, making it even bigger than it already was.
It's a fairly simple to change the logic to create a new partition if adding a
new piece would exceed the targetPostShuffleInputSize instead of only creating
a new partition after the targetPostShuffleInputSize has already been exceeded.
This results in a few more partitions being created by the
ExchangeCoordinator, but data skew problems are at least not made worse even
though they are not made any better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]