Gabor Kaszab created IMPALA-12861:
-------------------------------------
Summary: File formats are confused when Iceberg tables has mixed
formats
Key: IMPALA-12861
URL: https://issues.apache.org/jira/browse/IMPALA-12861
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 4.3.0
Reporter: Gabor Kaszab
*Repro steps:*
create table mixed_ice (i int, year int) partitioned by spec (year) stored as
iceberg tblproperties('format-version'='2');
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;
{code:java}
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB
thread-reservation=1 |
| PLAN-ROOT SINK
|
| | output exprs: default.mixed_ice.i, default.mixed_ice.year
|
| | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
thread-reservation=0 |
| |
|
| 01:EXCHANGE [UNPARTITIONED]
|
| mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
|
| tuple-ids=0 row-size=8B cardinality=2
|
| in pipelines: 00(GETNEXT)
|
|
|
| F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
|
| Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB
thread-reservation=2 |
| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]
|
| | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0
|
| 00:SCAN HDFS [default.mixed_ice, RANDOM]
|
| HDFS partitions=1/1 files=1 size=602B
|
| Iceberg snapshot id: 4964066258730898133
|
| skipped Iceberg predicates: `year` = CAST(2024 AS INT)
|
| stored statistics:
|
| table: rows=5 size=945B
|
| columns: unavailable
|
| extrapolated-rows=disabled max-scan-range-rows=5
|
| file formats: [ORC, PARQUET]
|
| mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1
|
| tuple-ids=0 row-size=8B cardinality=2
|
| in pipelines: 00(GETNEXT)
|
+------------------------------------------------------------------------------------------+
{code}
Note, the file formats: [ORC, PARQUET] part even though this query only reads
a parquet files.
*Some analyis:*
When IcebergScanNode [is
created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
it holds the correct information about file formats (Parquet).
Later on the parent class, HdfsScanNode also tries to populate the file formats
[here|[https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L513].]
It uses what
[getSampledOrRawPartition()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
returns. In this use case the 'sampledPartitions_' is null, so will return
'partitions_'
Apparently, this 'partitions_' member holds the partition with the ORC file so
it adds ORC to the fileFormats_.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)