GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/21101
[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle
## What changes were proposed in this pull request?
In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy
the data when a place buffers non-serialized objects.
Shuffle may buffer objects if we don't make it to the bypass merge shuffle
or unsafe shuffle.
`ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case that,
if `spark.sql.shuffle.partitions` is large enough, we could fail to run unsafe
shuffle and go with the non-serialized shuffle.
This bug is very hard to hit since users wouldn't set such a large number
of partitions for Spark SQL exchange.
TODO: test
## How was this patch tested?
todo.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark shuffle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21101.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21101
----
commit 40b2c5ca196427c1391e4abe60cb89dc36cbea77
Author: Wenchen Fan <wenchen@...>
Date: 2018-04-18T14:37:29Z
SQL exchange should copy data before non-serialized shuffle
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]