[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

srowen Mon, 03 Sep 2018 06:15:01 -0700

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r214685953
  
    --- Diff: 
core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
         val defaultMaxSplitBytes = 
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    I think it's hard to test, technically, because `setMinPartitions` is only 
a hint. In the case of `binaryFiles` we know it will put a hard limit on the 
number of partitions, but it isn't true of other implementations. We can still 
make a simple test for all of these, it just may be asserting behavior that 
could change in the future in Hadoop, though I strongly doubt it would.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Reply via email to