dramaticlly opened a new issue, #6567:
URL: https://github.com/apache/iceberg/issues/6567
### Apache Iceberg version
None
### Query engine
None
### Please describe the bug 🐞
I really like the new table scan feature released latest pyiceberg 0.2.1
release, thanks @Fokko. It works great when I provide the partition column as
row filter but not working as expected when I provide other columns as part of
expression. The `scan.plan_files()` shall return me collection of parquet files
satisfy the predicate in row filter but it's returning all instead.
Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to
start with
### schema
```ddl
Create Table
------------------------------------------------------------------------------------------
CREATE TABLE iceberg.hongyue_zhang.mls23 (
id bigint NOT NULL,
data varchar,
ts date
)
WITH (
format = 'PARQUET',
location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23',
partitioning = ARRAY['ts']
)
(1 row)
```
### Setup
Table have 2 partitions and 198 records total, each write have its own
parquet files for the sake of simplicity
```
partition | record_count | file_count | total_size |
data
-----------------+--------------+------------+------------+-------------------------------------------------------------------------------------
{ts=2023-01-04} | 99 | 99 | 115300 | {id={min=1,
max=1, null_count=0}, data={min=b, max=bbbbbbbbbbbbbbbc, null_count=0}}
{ts=2023-01-05} | 99 | 99 | 115303 | {id={min=0,
max=0, null_count=0}, data={min=a, max=aaaaaaaaaaaaaaab, null_count=0}}
(2 rows)
```
### Python code
```python
import os
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo
catalog = load_catalog("prod")
table = catalog.load_table("hongyue_zhang.mls23")
table.location()
scan1 = table.scan(
row_filter=EqualTo("ts", "2023-01-04"))
yesterday_files = [task.file.file_path for task in scan1.plan_files()]
print(len(yesterday_files))
# expect 99 and actual is 99 parquet files for single partition
scan2 = table.scan(
row_filter=EqualTo("data", "a"))
a_files = [task.file.file_path for task in scan2.plan_files()]
print(len(a_files))
# expect 1 but I am seeing 198 instead, which means all parquet files are
returned
scan3 = table.scan(
row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a")))
yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()]
print(len(yesterday_and_a_files))
# expect 1 but I am seeing 99, which means the row filter are taking the 1st
expression with partition column ts but not 2nd expression on data
```
For the sake of validation, I also tried to spark to query with similar
condition and it's returnning me 1 file as expected
```spark
val result = spark.sql("select id, data, input_file_name(), ts from
iceberg.hongyue_zhang.mls23 where data = 'a'")
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|id |data|input_file_name()
|ts |
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|0 |a
|s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/00000-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-00001.parquet
|2023-01-05|
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
```
I cant see to figure out why it failed and happy to contribute if anyone can
share insights
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]