steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704
 
 
   LGTM,
   
   
   There's one thing to be aware of, which is that `Path.get(conf)` maps to 
`FileSystem.get(path.getURI(), conf)`, which looks in a shared cache for 
existing instances (good) then instantiates one if there isn't one for that 
User + URI. And if that FS instance takes a while to start up (i.e. 
`FileSystem.initialize())` is slow, then multiple threads will all end up 
trying to create instances, then discard all but one afterwards. 
   
   Hence: https://github.com/apache/hadoop/pull/1838  ; removing some network 
IO in `S3AFileSystem.initialize()` by giving you the option of not bothering to 
look to see if the bucket exists.
   
   Does that mean there's anything wrong with this PR? No, only that 
performance is best if the relevant FS instances have already been preloaded 
into the FS cache. And those people implementing filesystem connectors should 
do a better job at low-latency instantiation, even if it means async network 
startup threads and moving the blocking to the first FS API call instead.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to