steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704 LGTM, There's one thing to be aware of, which is that `Path.get(conf)` maps to `FileSystem.get(path.getURI(), conf)`, which looks in a shared cache for existing instances (good) then instantiates one if there isn't one for that User + URI. And if that FS instance takes a while to start up (i.e. `FileSystem.initialize())` is slow, then multiple threads will all end up trying to create instances, then discard all but one afterwards. Hence: https://github.com/apache/hadoop/pull/1838 ; removing some network IO in `S3AFileSystem.initialize()` by giving you the option of not bothering to look to see if the bucket exists. Does that mean there's anything wrong with this PR? No, only that performance is best if the relevant FS instances have already been preloaded into the FS cache. And those people implementing filesystem connectors should do a better job at low-latency instantiation, even if it means async network startup threads and moving the blocking to the first FS API call instead.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
