rkkalluri commented on issue #6068: URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179538626
If you check the sql tab of spark-ui or the explain plan for both statements you see that partition pruning is happening and we only read files from 1 partition to satisfy your query. It is the planning stage that differs where spark catalog first needs to know all the partitions that do exist so that it can filter or prune them. That is exactly the problem Hudi metadata will solve and does not have to list all partitions <img width="478" alt="Screen Shot 2022-07-09 at 7 48 35 AM" src="https://user-images.githubusercontent.com/3401900/178106590-953c045c-9490-480d-ab4e-033789085672.png"> <img width="478" alt="Screen Shot 2022-07-09 at 7 47 45 AM" src="https://user-images.githubusercontent.com/3401900/178106592-fe0243f3-02c1-47f0-b88e-10bf1da270f9.png"> and hence the increased performance for you. >>> spark.read.format("hudi").option("hoodie.metadata.enable","true").load(basePath).filter("part=1").explain() == Physical Plan == *(1) ColumnarToRow +- FileScan parquet [_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,id#139L,combine#140L,part#141L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#141L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p... scala> spark.read.format("hudi").option("hoodie.metadata.enable","false").load(basePath).filter("part=1").explain(false) == Physical Plan == *(1) ColumnarToRow +- FileScan parquet [_hoodie_commit_time#65,_hoodie_commit_seqno#66,_hoodie_record_key#67,_hoodie_partition_path#68,_hoodie_file_name#69,id#70L,combine#71L,part#72L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#72L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
