srowen commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on PathFilter before listing them URL: https://github.com/apache/spark/pull/24237#issuecomment-478616785 Hm, I feel like I am still missing something about the implementation. @adrian-ionescu could I ask you to look at the logic here? I think you implemented a lot of the code in question. There is some value in filtering out dirs as listLeafFiles / bulkListLeafFiles recurses through the tree, because a filter might try to match intermediate directories and can filter them. I'm worried that a filter like `.endsWith(".tmp")` might match dirs instead of leaf files. But otherwise is this a good optimization? While a user can filter out top-level dirs they don't actually want to examine, I think there's a decent point here about more complex filters on nested intermediate dirs. If we're worried about matching intermediate directories for patterns intended to match leaf file paths, maybe it's possible to filter after listing dirs, but check the filter against the dir path plus "/" at the end if it doesn't already end in "/".
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
