GitHub user devoxi closed a discussion: How to improve Parquet reading
performances?
Hello!
We've been experimenting in the last couple of days with Datafusion (31.0) and
we've been comparing performances with our existing ClickHouse setup. To do so,
we have exported a 5GB Parquet dataset, and building some UDFs we managed to
replicate some of our queries.
In the end we are running a single quite simple query over the same Parquet
dataset on the same mac with both Datafusion and ClickHouse. ClickHouse is
always answering in about 700ms while Datafusion in 1.2s.
I've tried multiple settings, verified it was not our UDF causing it, checked
there was no cache on ClickHouse, and I couldn't make it any faster with
Datafusion. According to the EXPLAIN ANALYZE the poor performances are coming
from the Parquet phase.
I have to confess that we are beginners in Rust and we might have missed
something, hence this message.
Here is the EXPLAIN ANALYZE of our query, if it can help:
```
Plan with Metrics | ProjectionExec: expr=[SUM(my_table.sign)@0 as tcount,
SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) *
my_table.sign)@1 as _ccount_1], metrics=[output_rows=1, elapsed_compute=584ns]
AggregateExec: mode=Final, gby=[], aggr=[SUM(my_table.sign)@0 as tcount,
SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) *
my_table.sign)], metrics=[output_rows=1, elapsed_compute=48.459µs]
CoalescePartitionsExec, metrics=[output_rows=20,
elapsed_compute=5.708µs]
AggregateExec: mode=Partial, gby=[], aggr=[SUM(my_table.sign)@0 as
tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) *
my_table.sign)], metrics=[output_rows=20, elapsed_compute=4.769958246s]
ProjectionExec: expr=[sign@0 as sign,
nested_field.array_column@3 as nested_field.array_column],
metrics=[output_rows=6185527, elapsed_compute=396.238µs]
CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=6185527, elapsed_compute=168.795215ms]
FilterExec: int_column@1 = 2 AND (CAST(string_column@2
AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), metrics=[output_rows=6185527,
elapsed_compute=652.683132ms]
ParquetExec: file_groups={20 groups:
[[path/to/parquet/my_dataset.parquet:0..262363404],
[path/to/parquet/my_dataset.parquet:262363404..524726808],
[path/to/parquet/my_dataset.parquet:524726808..787090212],
[path/to/parquet/my_dataset.parquet:787090212..1049453616],
[path/to/parquet/my_dataset.parquet:1049453616..1311817020], ...]},
projection=[sign, int_column, string_column, nested_field.array_column],
predicate=int_column@5 = 2 AND (CAST(string_column@79 AS Utf8) !~*
(.*.|)(word1|word2|word3|word4).*), pruning_predicate=int_column_min@0 <= 2 AND
2 <= int_column_max@1, metrics=[output_rows=6185527, elapsed_compute=20ns,
page_index_rows_filtered=0, predicate_evaluation_errors=0, file_scan_errors=0,
row_groups_pruned=13, num_predicate_creation_errors=0, file_open_errors=0,
pushdown_rows_filtered=2880077, bytes_scanned=311061152,
time_elapsed_processing=5.335386071s, pushdown_eval_time=781.00424ms,
time_elapsed_scanning_until_data=1.472284126s, time_elapsed
_opening=3.032638581s, time_elapsed_scanning_total=18.9360584s,
page_index_eval_time=1.318µs]
```
We also noticed in some other queries that when having more Parquet files
performances were much worse than in ClickHouse compared to a single Parquet
file.
So is there anything we might have missed, that is general knowledge and could
that lead to those performances?
Thanks for your help!
GitHub link: https://github.com/apache/datafusion/discussions/7737
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]