Yun Zhao created SPARK-7630:
-------------------------------

             Summary: Going with Terasort,when test data's partitions is larger 
than 200,the sorted data cann't be passed by TeraValidate.
                 Key: SPARK-7630
                 URL: https://issues.apache.org/jira/browse/SPARK-7630
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 1.3.1
            Reporter: Yun Zhao


I take source of Terasort from 
'https://github.com/ehiggs/spark-terasort/tree/master/src/main/scala/com/github/ehiggs/spark/terasort'.The
 main code is below.
        "val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte], 
TeraInputFormat](inputFile)
    val sorted = dataset.partitionBy(new 
TeraSortPartitioner(dataset.partitions.size)).sortByKey()
    sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)"

Every time I validate the result if it's sorted correctly by TeraValidate when 
terasort is finished. I find sometimes it's not sorted correctly. Indeed I find 
if the number of dataset's partitions is a larger than 200,the result is not 
sorted correctly, and if less than or equal to 200,the result is sorted 
correctly.

I think for a long time, I set 'spark.shuffle.manager' to 'hash', which default 
value is 'sort'. Then no matter what the number of dataset's partitions is, the 
result is sorted correctly.

I study on source code, and I find '200' is related  to this parameter 
'spark.shuffle.sort.bypassMergeThreshold' which default value is 200. I then 
set 'spark.shuffle.sort.bypassMergeThreshold' to 400, and  it's expected that 
when the number of dataset's partitions is between 200 and 400, the result is 
sorted correctly.

Howerer,I still don't know how to solve this bug.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to