Re: [I] Parquet Predicate Pushdown Does Not Handle Type Coercion [arrow-datafusion]

via GitHub Thu, 08 Feb 2024 07:28:36 -0800


alamb commented on issue #7925:
URL: 
https://github.com/apache/arrow-datafusion/issues/7925#issuecomment-1934370530


   I have been trying to make a reproducer for this issue, but I can not seem 
to. 
   
   ```sql
   -- Create a parquet file to write two distinct row groups
   copy (
    values
     (arrow_cast('2024-12-19', 'Timestamp(Nanosecond, Some("UTC"))')),
     (arrow_cast('2024-12-20', 'Timestamp(Nanosecond, Some("UTC"))'))
   )
   to '/tmp/example.parquet'
   (
     'MAX_ROW_GROUP_SIZE' 1,
     DATA_PAGESIZE_LIMIT 1
   );
   
   +-------+
   | count |
   +-------+
   | 2     |
   +-------+
   1 row in set. Query took 0.016 seconds.
   ```
   
   This shows there are two row groups with UTC timestamp:
   
   ```sql
   ❯ describe '/tmp/example.parquet';
   +-------------+------------------------------------+-------------+
   | column_name | data_type                          | is_nullable |
   +-------------+------------------------------------+-------------+
   | column1     | Timestamp(Nanosecond, Some("UTC")) | YES         |
   +-------------+------------------------------------+-------------+
   1 row in set. Query took 0.002 seconds.
   
   ❯ select * from parquet_metadata('/tmp/example.parquet');
   
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   | filename             | row_group_id | row_group_num_rows | 
row_group_num_columns | row_group_bytes | column_id | file_offset | num_values 
| path_in_schema | type  | stats_min           | stats_max           | 
stats_null_count | stats_distinct_count | stats_min_value     | stats_max_value 
    | compression        | encodings                    | index_page_offset | 
dictionary_page_offset | data_page_offset | total_compressed_size | 
total_uncompressed_size |
   
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   | /tmp/example.parquet | 0            | 1                  | 1               
      | 71              | 0         | 93          | 1          | "column1"      
| INT64 | 1734566400000000000 | 1734566400000000000 | 0                |        
              | 1734566400000000000 | 1734566400000000000 | ZSTD(ZstdLevel(1)) 
| [PLAIN, RLE, RLE_DICTIONARY] |                   | 4                      | 
35               | 89                    | 71                      |
   | /tmp/example.parquet | 1            | 1                  | 1               
      | 71              | 0         | 258         | 1          | "column1"      
| INT64 | 1734652800000000000 | 1734652800000000000 | 0                |        
              | 1734652800000000000 | 1734652800000000000 | ZSTD(ZstdLevel(1)) 
| [PLAIN, RLE, RLE_DICTIONARY] |                   | 169                    | 
200              | 89                    | 71                      |
   
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   ```
   
   Now, when I create a table that requires type coercion (in this case 
defining `column1` as `Timestamp(Nanos, None)`), any queries I run from the SQL 
layer show row groups are pruned as  expected (`row_groups_pruned_statistics=1` 
in the output of `explain analyze`)
   
   
   ```sql
   ❯ create external table utc_table(column1 timestamp)  stored as parquet 
location '/tmp/example.parquet';
   0 rows in set. Query took 0.001 seconds.
   ```
   
   (note the predicate `column1='2024-12-19'::timestamp` compares `column1` to 
a `Timestamp(Nanos, None)` so requires coercion)
   
   ```sql
   ❯ explain analyze select * from utc_table where 
column1='2024-12-19'::timestamp;
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=1, elapsed_compute=3.302µs]                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
             |
   |                   |   FilterExec: column1@0 = 1734566400000000000, 
metrics=[output_rows=1, elapsed_compute=46.39µs]                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
           |
   |                   |     RepartitionExec: partitioning=RoundRobinBatch(16), 
input_partitions=1, metrics=[fetch_time=1.081666ms, repart_time=1ns, 
send_time=2.792µs]                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
              |
   |                   |       ParquetExec: file_groups={1 group: 
[[tmp/example.parquet]]}, projection=[column1], predicate=column1@0 = 
1734566400000000000, pruning_predicate=column1_min@0 <= 1734566400000000000 AND 
1734566400000000000 <= column1_max@1, required_guarantees=[column1 in 
(1734566400000000000)], metrics=[output_rows=1, elapsed_compute=1ns, 
file_open_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0, 
row_groups_pruned_statistics=1, num_predicate_creation_errors=0, 
bytes_scanned=172, pushdown_rows_filtered=0, row_groups_pruned_bloom_filter=0, 
file_scan_errors=0, time_elapsed_scanning_until_data=369.125µs, 
pushdown_eval_time=2ns, time_elapsed_opening=678.666µs, 
time_elapsed_scanning_total=389.041µs, time_elapsed_processing=585.499µs, 
page_index_eval_time=19.251µs] |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.012 seconds.
   ```
   
   
   The same result comes from 
   
   ``` sql
   ❯ explain analyze select * from '/tmp/example.parquet' where 
column1='2024-12-19'::timestamp;
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                             |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=1, elapsed_compute=4.042µs]                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
       |
   |                   |   FilterExec: column1@0 = 1734566400000000000, 
metrics=[output_rows=1, elapsed_compute=69µs]                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
     |
   |                   |     ParquetExec: file_groups={1 group: 
[[tmp/example.parquet]]}, projection=[column1], predicate=column1@0 = 
1734566400000000000, pruning_predicate=column1_min@0 <= 1734566400000000000 AND 
1734566400000000000 <= column1_max@1, required_guarantees=[column1 in 
(1734566400000000000)], metrics=[output_rows=1, elapsed_compute=1ns, 
file_open_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0, 
row_groups_pruned_statistics=1, num_predicate_creation_errors=0, 
bytes_scanned=172, pushdown_rows_filtered=0, row_groups_pruned_bloom_filter=0, 
file_scan_errors=0, time_elapsed_scanning_until_data=510.375µs, 
pushdown_eval_time=2ns, time_elapsed_opening=736µs, 
time_elapsed_scanning_total=596.125µs, time_elapsed_processing=554.584µs, 
page_index_eval_time=19.126µs] |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                             |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.015 seconds.
   ```
   
   Maybe the problem has something to do with constructing parquet exec 
directly as we do in InfluxDB
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet Predicate Pushdown Does Not Handle Type Coercion [arrow-datafusion]

Reply via email to