[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

Peter Rozsa (Jira) Wed, 29 Jan 2025 05:25:19 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Rozsa updated IMPALA-12861:
---------------------------------
    Fix Version/s: Impala 4.5.0

> File formats are confused when Iceberg tables has mixed formats
> ---------------------------------------------------------------
>
>                 Key: IMPALA-12861
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12861
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.3.0
>            Reporter: Gabor Kaszab
>            Assignee: Peter Rozsa
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.5.0
>
>         Attachments: multi_file_table_crash
>
>
> *Repro steps:*
> create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
> iceberg tblproperties('format-version'='2');
>  
> 1) populate one partition with Impala (parquet)
> insert into mixed_ice values (1, 2024), (2, 2024);
>  
> 2) change the write format:
> alter table mixed_ice set tblproperties ('write.format.default'='orc');
>  
> 3) populate another partition with Hive (orc)
> insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
>  
> 4) then query just the parquet partition:
> explain select * from mixed_ice where year = 2024;
> {code:java}
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>              |
> | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
> thread-reservation=1      |
> |   PLAN-ROOT SINK                                                            
>              |
> |   |  output exprs: default.mixed_ice.i, default.mixed_ice.year              
>              |
> |   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
> thread-reservation=0 |
> |   |                                                                         
>              |
> |   01:EXCHANGE [UNPARTITIONED]                                               
>              |
> |      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0           
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> |                                                                             
>              |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>              |
> | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
> thread-reservation=2    |
> |   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                
>              |
> |   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0           
>              |
> |   00:SCAN HDFS [default.mixed_ice, RANDOM]                                  
>              |
> |      HDFS partitions=1/1 files=1 size=602B                                  
>              |
> |      Iceberg snapshot id: 4964066258730898133                               
>              |
> |      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                 
>              |
> |      stored statistics:                                                     
>              |
> |        table: rows=5 size=945B                                              
>              |
> |        columns: unavailable                                                 
>              |
> |      extrapolated-rows=disabled max-scan-range-rows=5                       
>              |
> |      file formats: [ORC, PARQUET]                                           
>              |
> |      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1      
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> +------------------------------------------------------------------------------------------+
>  {code}
> Note, the file formats: [ORC, PARQUET] part even  though this query only 
> reads a parquet files.
>  
> *Some analyis:*
> When IcebergScanNode [is 
> created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
>  it holds the correct information about file formats (Parquet).
> Later on the parent class, HdfsScanNode also tries to populate the file 
> formats [here|#L513].]
>  
> It uses what 
> [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
>  returns. In this use case the 'sampledPartitions_' is null, so will return 
> 'partitions_'
>  
> Apparently, this 'partitions_' member holds the partition with the ORC file 
> so it adds ORC to the fileFormats_. Unfortunately, this 
> getSampledOrRawPartitions() is called in multiple locations within 
> HdfsScanNode returning the wrong partition.
> *Next steps:*
> Check what other issues can this getSampledOrRawPartitions cause with multi 
> file format tables. Also check if we can populate 'partitions_' properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

Reply via email to