Github user koertkuipers commented on the pull request:

    https://github.com/apache/spark/pull/11509#issuecomment-193921325
  
    if it did then it was not always in the apis i think? i remember the apis
    having paths: Seq[String] instead of files: Seq[FileStatus]. by explicitly
    requiring the user to list all files in the api you make it impossible not
    to, even if it turns out it is not always necessary. for 1mm files thats no
    joke.
    
    i found it was relatively straightforward to revert back to paths:
    Seq[String] once i ripped out the cache, modified partition discovery, and
    disabled some kind of data size estimation. so i more or less assumed it
    wasn't used anywhere else. but i might have missed split planning.
    
    
    On Tue, Mar 8, 2016 at 1:26 PM, Michael Armbrust <[email protected]>
    wrote:
    
    > @koertkuipers <https://github.com/koertkuipers> improving the efficiency
    > of working with large files was certainly a goal in this refactoring and
    > this API is definitely not done yet. That said, I'm not really sure that
    > the correct thing to do is to avoid listing all of the files at the 
driver.
    > Every version of Spark SQL has done this listing AFAIK during split
    > planning even before we added a caching layer.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11509#issuecomment-193902037>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to