Yun Zhao created SPARK-7630:
-------------------------------
Summary: Going with Terasort,when test data's partitions is larger
than 200,the sorted data cann't be passed by TeraValidate.
Key: SPARK-7630
URL: https://issues.apache.org/jira/browse/SPARK-7630
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 1.3.1
Reporter: Yun Zhao
I take source of Terasort from
'https://github.com/ehiggs/spark-terasort/tree/master/src/main/scala/com/github/ehiggs/spark/terasort'.The
main code is below.
"val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte],
TeraInputFormat](inputFile)
val sorted = dataset.partitionBy(new
TeraSortPartitioner(dataset.partitions.size)).sortByKey()
sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)"
Every time I validate the result if it's sorted correctly by TeraValidate when
terasort is finished. I find sometimes it's not sorted correctly. Indeed I find
if the number of dataset's partitions is a larger than 200,the result is not
sorted correctly, and if less than or equal to 200,the result is sorted
correctly.
I think for a long time, I set 'spark.shuffle.manager' to 'hash', which default
value is 'sort'. Then no matter what the number of dataset's partitions is, the
result is sorted correctly.
I study on source code, and I find '200' is related to this parameter
'spark.shuffle.sort.bypassMergeThreshold' which default value is 200. I then
set 'spark.shuffle.sort.bypassMergeThreshold' to 400, and it's expected that
when the number of dataset's partitions is between 200 and 400, the result is
sorted correctly.
Howerer,I still don't know how to solve this bug.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]