Github user fidato13 commented on the issue:
       * Create an RDD for non-bucketed reads.
       * The bucketed variant of this function is [[createBucketedReadRDD]].
       * @param readFile a function to read each (part of a) file.
       * @param selectedPartitions Hive-style partition that are part of the 
       * @param fsRelation [[HadoopFsRelation]] associated with the read.
      private def createNonBucketedReadRDD(
          readFile: (PartitionedFile) => Iterator[InternalRow],
          selectedPartitions: Seq[Partition],
          fsRelation: HadoopFsRelation): RDD[InternalRow] = {
        val defaultMaxSplitBytes =
        val openCostInBytes = 
        val defaultParallelism = 
        val totalBytes = selectedPartitions.flatMap( + 
        val bytesPerCore = totalBytes / defaultParallelism
        val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
    ================ This is the calculation happening currently in Spark SQL 
while considering openCostInBytes to avoid creating large number of partitions.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to