Github user fidato13 commented on the issue:

    https://github.com/apache/spark/pull/15327
  
    @rxin Yes, it makes perfect sense to not create a partition per file. 
    Looking at the code in PortableDataStream.setMinPartitions:-
    val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong
    
    Unless the user specifies minPartitions , the default picked would most 
likely create just two partitions always irrespective of the size and count of 
the files.
    
    The change in this pull request makes the partition count consistent with 
other RDD types . For example , textFile and binaryFiles would be presenting 
the same number of partitions.
    
    For sure, your concern is to be kept at top priority that is to not create 
one partition per small file and which most likely may be kept as an 
improvement change in the near future? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to