ConeyLiu commented on issue #24237: [SPARK-27319][SQL] Filter out dir based on PathFilter before listing them URL: https://github.com/apache/spark/pull/24237#issuecomment-477830653 Hi @srowen @HyukjinKwon, thanks for the review. Before this patch, we need to list all files under directories and then filter them based on the `PatchFilter`. A Spark job will be triggered if the number of directories is very big. > how much roughly does it improve the perf? I didn't measure the roughly perf gain. I found this as the following examples: I have many data stores in HDFS and in the following format: ``` hdfs://name:port/root-dir/timestamp=2019-01-01/*** hdfs://name:port/root-dir/timestamp=2019-01-02/*** .... hdfs://name:port/root-dir/timestamp=2019-03-04/*** ``` And sometimes I just need to read part of directories, so I add a PathFilter in the conf. However, it still need to trigger a Spark job to listing them even the needed directory is just one. That's like what I added in the UT test.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
