ConeyLiu commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on 
PathFilter before listing them
URL: https://github.com/apache/spark/pull/24237#issuecomment-477830653
 
 
   Hi @srowen @HyukjinKwon, thanks for the review. Before this patch, we need 
to list all files under directories and then filter them based on the 
`PatchFilter`. A Spark job will be triggered if the number of directories is 
very big. 
   
   > how much roughly does it improve the perf?
   
   I didn't measure the roughly perf gain. I found this as the following 
examples:
   I have many data stores in HDFS and in the following format:
   ```
   hdfs://name:port/root-dir/timestamp=2019-01-01/***
   hdfs://name:port/root-dir/timestamp=2019-01-02/***
   ....
   hdfs://name:port/root-dir/timestamp=2019-03-04/***
   ```
   And sometimes I just need to read part of directories, so I add a PathFilter 
in the conf. However, it still need to trigger a Spark job to listing them even 
the needed directory is just one. That's like what I added in the UT test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to