Github user koertkuipers commented on the pull request:
https://github.com/apache/spark/pull/11509#issuecomment-193921325
if it did then it was not always in the apis i think? i remember the apis
having paths: Seq[String] instead of files: Seq[FileStatus]. by explicitly
requiring the user to list all files in the api you make it impossible not
to, even if it turns out it is not always necessary. for 1mm files thats no
joke.
i found it was relatively straightforward to revert back to paths:
Seq[String] once i ripped out the cache, modified partition discovery, and
disabled some kind of data size estimation. so i more or less assumed it
wasn't used anywhere else. but i might have missed split planning.
On Tue, Mar 8, 2016 at 1:26 PM, Michael Armbrust <[email protected]>
wrote:
> @koertkuipers <https://github.com/koertkuipers> improving the efficiency
> of working with large files was certainly a goal in this refactoring and
> this API is definitely not done yet. That said, I'm not really sure that
> the correct thing to do is to avoid listing all of the files at the
driver.
> Every version of Spark SQL has done this listing AFAIK during split
> planning even before we added a caching layer.
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/11509#issuecomment-193902037>.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]