[GitHub] [arrow-datafusion] Dandandan commented on issue #1329: DuckDB Comparison - Questions + Bug

GitBox Thu, 18 Nov 2021 08:15:07 -0800


Dandandan commented on issue #1329:
URL: 
https://github.com/apache/arrow-datafusion/issues/1329#issuecomment-973015663

Hey @pdet

Thanks for opening this issue! I am personally very interested if there are
components within the ecosystem that could be shared between engines such as
DuckDB and DataFusion and the pyarrow library. I am thinking about query
optimization, re-using UDFs, executing certain parts in one engine or the
other, etc.

For the timings - let me play a bit with your example - I'll come back to
it. At first glance, those two things come to mind when looking at the code:

* It looks like DuckDB doesn't need to load all columns in memory whereas
`register_record_batches` pre-loads the data (all columns) in memory.
* DataFusion can benefit from more parallelism for Parquet reading (See
notes on question 2.) by splitting your data into *n* equal partitions / files.

Looking forward to the blog post!

> 1. It seems to me that DataFusion is not consuming streaming arrow objects
(i.e., not fully materialized objects), is that correct?

I am not 100% certain what you mean here. `register_record_batches` (see PR
https://github.com/apache/arrow-datafusion/pull/981) currently uses `MemTable`
which keeps the data as batches in memory (in a fully materialized list). We
don't have a streaming table provider as of now which consumes a stream /
iterator of Arrow batches. Also would be interesting to see if we can integrate
somehow with an Arrow `Table` instead.

You can register tables with stored files based on a location (by
using`register_parquet` instead). This streams the data from Parquet (or
CSV/JSON/Avro) from disk.

> 2. Is DataFusion already running in parallel?

Yes - plans are optimized use as much parallelism as possible. Scans,
aggregates, joins, filters, etc. are parallelized (based on available cores by
default). I added some optimizations to this in earlier versions
https://medium.com/@danilheres/increasing-the-level-of-parallelism-in-datafusion-4-0-d2a15b5a2093

For TPC-H queries - parallelism depends a large part on the amount of
parquet files you have as input, as 1 file maps to 1 IO task currently (similar
as Spark). Many of the TPC-H queries are IO and Parquet heavy so that can have
a large impact. For now I suggest to split/partition the data into at roughly
the number of cores available on the machine files - e.g. something very common
in Spark workloads.
In the future we might want to revisit this to utilize more parallelism such
as for when the number of files is small. @jorgecarleitao has done some
research on this in his `parquet2` and `arrow2` repositories.

> 3. Is there a specific batch size that is more attractive to the
DataFusion engine?

It depends mostly on the query - simple counts or sums favor a bigger batch
size while other parts might benefit from a not-too-big batch size. Based on
some earlier benchmarking the default batch size is in DataFusion set to 8192.
Interested if you have different findings here :).

For the bug - let's try to minimize that example and find the cause of the
error!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #1329: DuckDB Comparison - Questions + Bug

Reply via email to