srowen commented on a change in pull request #24237: [SPARK-27319][SQL] Filter 
out dir based on PathFilter before listing them
URL: https://github.com/apache/spark/pull/24237#discussion_r270670227
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 ##########
 @@ -167,36 +167,39 @@ object InMemoryFileIndex extends Logging {
       hadoopConf: Configuration,
       filter: PathFilter,
       sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = {
+    // Filter out the directory before listing it leaf files
+    val filteredPaths = paths.filter(filter.accept(_))
 
     // Short-circuits parallel listing when serial listing is likely to be 
faster.
-    if (paths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
-      return paths.map { path =>
+    if (filteredPaths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
 
 Review comment:
   I don't think debug info helps, if it makes the behavior wrong. 
   
   The filter is intended to match on the leaf file paths only, right? I'd also 
not expect that the caller passes top-level paths that are meant to be filtered 
out. Any recursive calls to this method would have already applied the filter, 
right?
   
   I think I'm more wondering whether this filter would ever do anything. The 
paths are like "/my/data/path" and filters come from the user setting 
`mapreduce.input.pathFilter.class`, and I'm not sure how often or how those are 
used. It would typically match types of files by extension or something, I 
guess, or match intermediate directories, which would do nothing if applied 
here.
   
   Is the use case just setting a filter to apply an operation to a subset of 
dirs? why not filter them out at the caller level?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to