[jira] [Resolved] (SPARK-7630) Going with Terasort,when test data's partitions is larger than 200,the sorted data cann't be passed by TeraValidate.

Sean Owen (JIRA) Thu, 14 May 2015 02:02:41 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-7630.
------------------------------
    Resolution: Invalid

I don't believe this is a question about Spark itself, or at least, has not 
been narrowed down enough to show the problem.

> Going with Terasort,when test data's partitions is larger than 200,the sorted 
> data cann't be passed by TeraValidate.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7630
>                 URL: https://issues.apache.org/jira/browse/SPARK-7630
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.3.1
>            Reporter: Yun Zhao
>
> I take source of Terasort from 
> 'https://github.com/ehiggs/spark-terasort/tree/master/src/main/scala/com/github/ehiggs/spark/terasort'.The
>  main code is below.
>       "val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte], 
> TeraInputFormat](inputFile)
>     val sorted = dataset.partitionBy(new 
> TeraSortPartitioner(dataset.partitions.size)).sortByKey()
>     sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)"
> Every time I validate the result if it's sorted correctly by TeraValidate 
> when terasort is finished. I find sometimes it's not sorted correctly. Indeed 
> I find if the number of dataset's partitions is a larger than 200,the result 
> is not sorted correctly, and if less than or equal to 200,the result is 
> sorted correctly.
> I think for a long time, I set 'spark.shuffle.manager' to 'hash', which 
> default value is 'sort'. Then no matter what the number of dataset's 
> partitions is, the result is sorted correctly.
> I study on source code, and I find '200' is related  to this parameter 
> 'spark.shuffle.sort.bypassMergeThreshold' which default value is 200. I then 
> set 'spark.shuffle.sort.bypassMergeThreshold' to 400, and  it's expected that 
> when the number of dataset's partitions is between 200 and 400, the result is 
> sorted correctly.
> Howerer,I still don't know how to solve this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-7630) Going with Terasort,when test data's partitions is larger than 200,the sorted data cann't be passed by TeraValidate.

Reply via email to