Github user koertkuipers commented on the pull request:
https://github.com/apache/spark/pull/11509#issuecomment-193877543
i believe the need to pass all files along (e.g. inputFiles:
Array[FileStatus]) instead of just the input paths came from the need to cache
it so that stuff looked snappy on s3 which has slow meta operations.
however it is not very realistic to pass along all files for real datasets,
since it can easily be size 100k+ (and some people reported using millions of
files on mailing list).
because of this inputFiles param we now need driver programs with 16G of
heap or larger (before 1G was enough), and even then it doesn't always work on
very large datasets. i would hate to see inputFiles make it into spark 2.0 api,
instead of just inputPaths.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]