alamb commented on issue #7925:
URL:
https://github.com/apache/arrow-datafusion/issues/7925#issuecomment-1934370530
I have been trying to make a reproducer for this issue, but I can not seem
to.
```sql
-- Create a parquet file to write two distinct row groups
copy (
values
(arrow_cast('2024-12-19', 'Timestamp(Nanosecond, Some("UTC"))')),
(arrow_cast('2024-12-20', 'Timestamp(Nanosecond, Some("UTC"))'))
)
to '/tmp/example.parquet'
(
'MAX_ROW_GROUP_SIZE' 1,
DATA_PAGESIZE_LIMIT 1
);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.016 seconds.
```
This shows there are two row groups with UTC timestamp:
```sql
❯ describe '/tmp/example.parquet';
+-------------+------------------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------+------------------------------------+-------------+
| column1 | Timestamp(Nanosecond, Some("UTC")) | YES |
+-------------+------------------------------------+-------------+
1 row in set. Query took 0.002 seconds.
❯ select * from parquet_metadata('/tmp/example.parquet');
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
| filename | row_group_id | row_group_num_rows |
row_group_num_columns | row_group_bytes | column_id | file_offset | num_values
| path_in_schema | type | stats_min | stats_max |
stats_null_count | stats_distinct_count | stats_min_value | stats_max_value
| compression | encodings | index_page_offset |
dictionary_page_offset | data_page_offset | total_compressed_size |
total_uncompressed_size |
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
| /tmp/example.parquet | 0 | 1 | 1
| 71 | 0 | 93 | 1 | "column1"
| INT64 | 1734566400000000000 | 1734566400000000000 | 0 |
| 1734566400000000000 | 1734566400000000000 | ZSTD(ZstdLevel(1))
| [PLAIN, RLE, RLE_DICTIONARY] | | 4 |
35 | 89 | 71 |
| /tmp/example.parquet | 1 | 1 | 1
| 71 | 0 | 258 | 1 | "column1"
| INT64 | 1734652800000000000 | 1734652800000000000 | 0 |
| 1734652800000000000 | 1734652800000000000 | ZSTD(ZstdLevel(1))
| [PLAIN, RLE, RLE_DICTIONARY] | | 169 |
200 | 89 | 71 |
+----------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+-------+---------------------+---------------------+------------------+----------------------+---------------------+---------------------+--------------------+------------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
```
Now, when I create a table that requires type coercion (in this case
defining `column1` as `Timestamp(Nanos, None)`), any queries I run from the SQL
layer show row groups are pruned as expected (`row_groups_pruned_statistics=1`
in the output of `explain analyze`)
```sql
❯ create external table utc_table(column1 timestamp) stored as parquet
location '/tmp/example.parquet';
0 rows in set. Query took 0.001 seconds.
```
(note the predicate `column1='2024-12-19'::timestamp` compares `column1` to
a `Timestamp(Nanos, None)` so requires coercion)
```sql
❯ explain analyze select * from utc_table where
column1='2024-12-19'::timestamp;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=1, elapsed_compute=3.302µs]
|
| | FilterExec: column1@0 = 1734566400000000000,
metrics=[output_rows=1, elapsed_compute=46.39µs]
|
| | RepartitionExec: partitioning=RoundRobinBatch(16),
input_partitions=1, metrics=[fetch_time=1.081666ms, repart_time=1ns,
send_time=2.792µs]
|
| | ParquetExec: file_groups={1 group:
[[tmp/example.parquet]]}, projection=[column1], predicate=column1@0 =
1734566400000000000, pruning_predicate=column1_min@0 <= 1734566400000000000 AND
1734566400000000000 <= column1_max@1, required_guarantees=[column1 in
(1734566400000000000)], metrics=[output_rows=1, elapsed_compute=1ns,
file_open_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0,
row_groups_pruned_statistics=1, num_predicate_creation_errors=0,
bytes_scanned=172, pushdown_rows_filtered=0, row_groups_pruned_bloom_filter=0,
file_scan_errors=0, time_elapsed_scanning_until_data=369.125µs,
pushdown_eval_time=2ns, time_elapsed_opening=678.666µs,
time_elapsed_scanning_total=389.041µs, time_elapsed_processing=585.499µs,
page_index_eval_time=19.251µs] |
| |
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.012 seconds.
```
The same result comes from
``` sql
❯ explain analyze select * from '/tmp/example.parquet' where
column1='2024-12-19'::timestamp;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=1, elapsed_compute=4.042µs]
|
| | FilterExec: column1@0 = 1734566400000000000,
metrics=[output_rows=1, elapsed_compute=69µs]
|
| | ParquetExec: file_groups={1 group:
[[tmp/example.parquet]]}, projection=[column1], predicate=column1@0 =
1734566400000000000, pruning_predicate=column1_min@0 <= 1734566400000000000 AND
1734566400000000000 <= column1_max@1, required_guarantees=[column1 in
(1734566400000000000)], metrics=[output_rows=1, elapsed_compute=1ns,
file_open_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0,
row_groups_pruned_statistics=1, num_predicate_creation_errors=0,
bytes_scanned=172, pushdown_rows_filtered=0, row_groups_pruned_bloom_filter=0,
file_scan_errors=0, time_elapsed_scanning_until_data=510.375µs,
pushdown_eval_time=2ns, time_elapsed_opening=736µs,
time_elapsed_scanning_total=596.125µs, time_elapsed_processing=554.584µs,
page_index_eval_time=19.126µs] |
| |
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.015 seconds.
```
Maybe the problem has something to do with constructing parquet exec
directly as we do in InfluxDB
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]