[jira] [Created] (HUDI-7162) RDD's Don't cache in some situations with new filegroup reader + new parquet file format

Jonathan Vexler (Jira) Wed, 29 Nov 2023 16:01:04 -0800

Jonathan Vexler created HUDI-7162:
-------------------------------------

             Summary: RDD's Don't cache in some situations with new filegroup 
reader + new parquet file format
                 Key: HUDI-7162
                 URL: https://issues.apache.org/jira/browse/HUDI-7162
             Project: Apache Hudi
          Issue Type: Bug
          Components: spark, spark-sql
            Reporter: Jonathan Vexler



"Test Call rollback_to_instant Procedure with refreshTable" 

Fails if a projection is added to the query plan. The test does not currently 
fail, because we don't do the project for non-partitioned tables. Adding the 
projection prevents the rdd from being cached.

Query plans:

without projection, caching works:
{code:java}
== Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
== Analyzed Logical Plan ==id: intProject [id#552]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L]
 parquet
== Optimized Logical Plan ==InMemoryRelation [id#552], StorageLevel(disk, 
memory, deserialized, 1 replicas)   +- *(1) ColumnarToRow      +- FileScan 
parquet default.h0[id#552] Batched: true, DataFilters: [], Format: Parquet, 
Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
== Physical Plan ==InMemoryTableScan [id#552]   +- InMemoryRelation [id#552], 
StorageLevel(disk, memory, deserialized, 1 replicas)         +- *(1) 
ColumnarToRow            +- FileScan parquet default.h0[id#552] Batched: true, 
DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> {code}
With projection, no caching:
{code:java}
== Parsed Logical Plan =='Project ['id]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Analyzed Logical Plan ==id: intProject [id#544]+- SubqueryAlias 
spark_catalog.default.h0   +- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Optimized Logical Plan ==Project [id#544]+- Relation 
default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L]
 parquet
== Physical Plan ==*(1) ColumnarToRow+- FileScan parquet default.h0[id#544] 
Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-8c60...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7162) RDD's Don't cache in some situations with new filegroup reader + new parquet file format

Reply via email to