Github user fidato13 commented on the issue:
https://github.com/apache/spark/pull/15327
@rxin Yes, it makes perfect sense to not create a partition per file.
Looking at the code in PortableDataStream.setMinPartitions:-
val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong
Unless the user specifies minPartitions , the default picked would most
likely create just two partitions always irrespective of the size and count of
the files.
The change in this pull request makes the partition count consistent with
other RDD types . For example , textFile and binaryFiles would be presenting
the same number of partitions.
For sure, your concern is to be kept at top priority that is to not create
one partition per small file and which most likely may be kept as an
improvement change in the near future?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]