[GitHub] [spark] guanziyue opened a new pull request, #37498: [SPARK-40058][CORE] Avoid filter file path twice in HadoopFSUtils

GitBox Fri, 12 Aug 2022 09:40:40 -0700


guanziyue opened a new pull request, #37498:
URL: https://github.com/apache/spark/pull/37498


   ### What changes were proposed in this pull request?
   Refactor path filter logic in HadoopFSUtils to avoid the same filter logic 
is applied to a file multiple time. Method listLeafFiles is called recursively. 
Especially, this filter will be used in single thread on all files at driver 
side. This will lead to a performance issue when the filter logic is heavy. 
   
   
   ### Why are the changes needed?
   Apply filter only on filestatus as soon as they are firstly met.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   No test was added as such change is simple enough.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] guanziyue opened a new pull request, #37498: [SPARK-40058][CORE] Avoid filter file path twice in HadoopFSUtils

Reply via email to