Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r198120457
  
    --- Diff: 
core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -45,7 +45,8 @@ private[spark] abstract class StreamFileInputFormat[T]
        * which is set through setMaxSplitSize
        */
       def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
    -    val defaultMaxSplitBytes = 
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    +    val defaultMaxSplitBytes = Math.max(
    +      sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
         val defaultParallelism = sc.defaultParallelism
    --- End diff --
    
    Could you describe the use case when you need to take into account 
`minPartitions`. By default, `FILES_MAX_PARTITION_BYTES` is 128MB. Let's say it 
is even set to 1000, and `minPartitions` equals to 10 000. What is the reason 
to set the max size of splits in **bytes** to the min **number** of partition. 
Why should bigger number of partitions require bigger split size? Could you add 
more details to the PR description, please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to