Github user fidato13 commented on the issue:

    https://github.com/apache/spark/pull/15327
  
     /**
       * Create an RDD for non-bucketed reads.
       * The bucketed variant of this function is [[createBucketedReadRDD]].
       *
       * @param readFile a function to read each (part of a) file.
       * @param selectedPartitions Hive-style partition that are part of the 
read.
       * @param fsRelation [[HadoopFsRelation]] associated with the read.
       */
      private def createNonBucketedReadRDD(
          readFile: (PartitionedFile) => Iterator[InternalRow],
          selectedPartitions: Seq[Partition],
          fsRelation: HadoopFsRelation): RDD[InternalRow] = {
        val defaultMaxSplitBytes =
          fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
        val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
        val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
        val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
        val bytesPerCore = totalBytes / defaultParallelism
    
        val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
    
    
    ================ This is the calculation happening currently in Spark SQL 
while considering openCostInBytes to avoid creating large number of partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to