umehrot2 edited a comment on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-808606053
@pengzhiwei2018 I was testing Hudi without this patch via Spark SQL and I am a little confused. With Spark SQL I see partition pruning already works seamlessly for Hudi. Just start spark sql with: ``` spark-sql --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter" --jars /usr/lib/hudi/hudi-spark-bundle.jar ``` Spark is able to get the partition schema from the catalog using `CatalogFileIndex` and do the partition pruning. So this partition pruning support we are adding, is this to be able to support partition pruning for datasource based queries ? I think for hive style partition tables pruning should have already worked via Spark datasource too, because Spark tries to identify partition columns from the path, but not sure why it does not work. I want to understand clearly what we are gaining when this PR gets merged, is it partition pruning for Spark datasource queries ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
