[GitHub] [spark] rrusso2007 removed a comment on issue #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex

GitBox Wed, 22 May 2019 12:10:15 -0700

rrusso2007 removed a comment on issue #24679: [SPARK-27807][SQL] Parallel 
resolve leaf statuses InMemoryFileIndex
URL: https://github.com/apache/spark/pull/24679#issuecomment-494906133
 
 
   In my pull request #24672 I am switching to using a method that returns 
LocatedFileStatus and these lookups are unnecessary in that case. For HDFS then 
specifically this won't be necessary to do the optimization in this pull. Maybe 
if we do want to do an optimization like this pull for other file systems we 
should filter out the LocatedFileStatus first instead of detecting them in 
parallel. If we do that and they are all located already then there's no need 
to make a parallel collection to resolve them. 
   
   On another note, when doing the parallel resolution of multiple paths, the 
existing system uses a spark job to do parallelization as opposed to a parallel 
collection. There might be risk of launching many parallel threads in the 
driver like this unexpectedly as opposed to offloading this to the executors 
which have allocated cores.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rrusso2007 removed a comment on issue #24679: [SPARK-27807][SQL] Parallel resolve leaf statuses InMemoryFileIndex

Reply via email to