srowen commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on 
PathFilter before listing them
URL: https://github.com/apache/spark/pull/24237#issuecomment-478616785
 
 
   Hm, I feel like I am still missing something about the implementation. 
@adrian-ionescu could I ask you to look at the logic here? I think you 
implemented a lot of the code in question.
   
   There is some value in filtering out dirs as listLeafFiles / 
bulkListLeafFiles recurses through the tree, because a filter might try to 
match intermediate directories and can filter them. I'm worried that a filter 
like `.endsWith(".tmp")` might match dirs instead of leaf files. But otherwise 
is this a good optimization?
   
   While a user can filter out top-level dirs they don't actually want to 
examine, I think there's a decent point here about more complex filters on 
nested intermediate dirs.
   
   If we're worried about matching intermediate directories for patterns 
intended to match leaf file paths, maybe it's possible to filter after listing 
dirs, but check the filter against the dir path plus "/" at  the end if it 
doesn't already end in "/". 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to