[
https://issues.apache.org/jira/browse/SPARK-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-7630.
------------------------------
Resolution: Invalid
I don't believe this is a question about Spark itself, or at least, has not
been narrowed down enough to show the problem.
> Going with Terasort,when test data's partitions is larger than 200,the sorted
> data cann't be passed by TeraValidate.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-7630
> URL: https://issues.apache.org/jira/browse/SPARK-7630
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 1.3.1
> Reporter: Yun Zhao
>
> I take source of Terasort from
> 'https://github.com/ehiggs/spark-terasort/tree/master/src/main/scala/com/github/ehiggs/spark/terasort'.The
> main code is below.
> "val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte],
> TeraInputFormat](inputFile)
> val sorted = dataset.partitionBy(new
> TeraSortPartitioner(dataset.partitions.size)).sortByKey()
> sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)"
> Every time I validate the result if it's sorted correctly by TeraValidate
> when terasort is finished. I find sometimes it's not sorted correctly. Indeed
> I find if the number of dataset's partitions is a larger than 200,the result
> is not sorted correctly, and if less than or equal to 200,the result is
> sorted correctly.
> I think for a long time, I set 'spark.shuffle.manager' to 'hash', which
> default value is 'sort'. Then no matter what the number of dataset's
> partitions is, the result is sorted correctly.
> I study on source code, and I find '200' is related to this parameter
> 'spark.shuffle.sort.bypassMergeThreshold' which default value is 200. I then
> set 'spark.shuffle.sort.bypassMergeThreshold' to 400, and it's expected that
> when the number of dataset's partitions is between 200 and 400, the result is
> sorted correctly.
> Howerer,I still don't know how to solve this bug.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]