[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns

GitBox Wed, 11 Jan 2023 11:12:10 -0800


dramaticlly opened a new issue, #6567:
URL: https://github.com/apache/iceberg/issues/6567


   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I really like the new table scan feature released latest pyiceberg 0.2.1 
release, thanks @Fokko. It works great when I provide the partition column as 
row filter but not working as expected when I provide other columns as part of 
expression. The `scan.plan_files()` shall return me collection of parquet files 
satisfy the predicate in row filter but it's returning all instead. 
   
   Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to 
start with
   
   ### schema
   ```ddl
                                          Create Table
   
------------------------------------------------------------------------------------------
    CREATE TABLE iceberg.hongyue_zhang.mls23 (
       id bigint NOT NULL,
       data varchar,
       ts date
    )
    WITH (
       format = 'PARQUET',
       location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23',
       partitioning = ARRAY['ts']
    )
   (1 row)
   ```
   
   ### Setup
   Table have 2 partitions  and 198 records total, each write have its own 
parquet files for the sake of simplicity 
   ```
        partition    | record_count | file_count | total_size |                 
                       data
   
-----------------+--------------+------------+------------+-------------------------------------------------------------------------------------
    {ts=2023-01-04} |           99 |         99 |     115300 | {id={min=1, 
max=1, null_count=0}, data={min=b, max=bbbbbbbbbbbbbbbc, null_count=0}}
    {ts=2023-01-05} |           99 |         99 |     115303 | {id={min=0, 
max=0, null_count=0}, data={min=a, max=aaaaaaaaaaaaaaab, null_count=0}}
   (2 rows)
   ```
   
   ### Python code
   ```python
   import os
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo
   
   catalog = load_catalog("prod")
   table = catalog.load_table("hongyue_zhang.mls23")
   table.location()
   
   scan1 = table.scan(
       row_filter=EqualTo("ts", "2023-01-04"))
   yesterday_files = [task.file.file_path for task in scan1.plan_files()]
   print(len(yesterday_files))
   # expect 99 and actual is 99 parquet files for single partition
   
   scan2 = table.scan(
       row_filter=EqualTo("data", "a"))
   a_files = [task.file.file_path for task in scan2.plan_files()]
   print(len(a_files))
   # expect 1 but I am seeing 198 instead, which means all parquet files are 
returned
   
   scan3 = table.scan(
       row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a")))
   yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()]
   print(len(yesterday_and_a_files))
   # expect 1 but I am seeing 99, which means the row filter are taking the 1st 
expression with partition column ts but not 2nd expression on data 
   ```
   
   For the sake of validation, I also tried to spark to query with similar 
condition and it's returnning me 1 file as expected
   ```spark
   val result = spark.sql("select id, data, input_file_name(), ts from 
iceberg.hongyue_zhang.mls23 where data = 'a'")
   
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |id |data|input_file_name()                                                  
                                                                                
  |ts        |
   
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   |0  |a   
|s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/00000-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-00001.parquet
               |2023-01-05|
   
+---+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+
   ```
   
   
   I cant see to figure out why it failed and happy to contribute if anyone can 
share insights 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns

Reply via email to