[I] RDD's Don't cache in some situations with new filegroup reader + new parquet file format [hudi]

via GitHub Sun, 30 Nov 2025 01:25:08 -0800


hudi-bot opened a new issue, #16326:
URL: https://github.com/apache/hudi/issues/16326


   "Test Call rollback_to_instant Procedure with refreshTable" 
   
   Fails if a projection is added to the query plan. The test does not 
currently fail, because we don't do the project for non-partitioned tables. 
Adding the projection prevents the rdd from being cached.
   
   Query plans:
   
   without projection, caching works:
   {code:java}
   == Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
   == Analyzed Logical Plan ==id: intProject [id#552]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
   == Optimized Logical Plan ==InMemoryRelation [id#552], StorageLevel(disk, 
memory, deserialized, 1 replicas)   +- *(1) ColumnarToRow      +- FileScan 
parquet default.h0[id#552] Batched: true, DataFilters: [], Format: Parquet, 
Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
   == Physical Plan ==InMemoryTableScan [id#552]   +- InMemoryRelation 
[id#552], StorageLevel(disk, memory, deserialized, 1 replicas)         +- *(1) 
ColumnarToRow            +- FileScan parquet default.h0[id#552] Batched: true, 
DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> {code}
   With projection, no caching:
   {code:java}
   == Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
   == Analyzed Logical Plan ==id: intProject [id#544]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
   == Optimized Logical Plan ==Project [id#544]+- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
   == Physical Plan ==*(1) ColumnarToRow+- FileScan parquet default.h0[id#544] 
Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-8c60...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
   
   {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-7162
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-6568


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] RDD's Don't cache in some situations with new filegroup reader + new parquet file format [hudi]

Reply via email to