Dandandan commented on issue #1329:
URL: 
https://github.com/apache/arrow-datafusion/issues/1329#issuecomment-973015663


   Hey @pdet 
   
   Thanks for opening this issue! I am personally very interested if there are 
components within the ecosystem that could be shared between engines such as 
DuckDB and DataFusion and the pyarrow library. I am thinking about query 
optimization, re-using UDFs, executing certain parts in one engine or the 
other, etc.
   
   For the timings - let me play a bit with your example - I'll come back to 
it. At first glance, those two things come to mind when looking at the code:
   
   * It looks like DuckDB doesn't need to load all columns in memory whereas 
`register_record_batches` pre-loads the data (all columns) in memory.
   * DataFusion can benefit from more parallelism for Parquet reading (See 
notes on question 2.) by splitting your data into *n* equal partitions / files.
   
   Looking forward to the blog post!
   
   > 1. It seems to me that DataFusion is not consuming streaming arrow objects 
(i.e., not fully materialized objects), is that correct?
   
   I am not 100% certain what you mean here. `register_record_batches` (see PR 
https://github.com/apache/arrow-datafusion/pull/981) currently uses `MemTable` 
which keeps the data as batches in memory (in a fully materialized list). We 
don't have a streaming table provider as of now which consumes a stream / 
iterator of Arrow batches. Also would be interesting to see if we can integrate 
somehow with an Arrow `Table` instead.
   
   You can register tables with stored files based on a location (by 
using`register_parquet` instead). This streams the data from Parquet (or 
CSV/JSON/Avro) from disk.
   
   > 2. Is DataFusion already running in parallel?
   
   Yes - plans are optimized use as much parallelism as possible. Scans, 
aggregates, joins, filters, etc. are parallelized (based on available cores by 
default). I added some optimizations to this in earlier versions 
https://medium.com/@danilheres/increasing-the-level-of-parallelism-in-datafusion-4-0-d2a15b5a2093
   
   For TPC-H queries - parallelism depends a large part on the amount of 
parquet files you have as input, as 1 file maps to 1 IO task currently (similar 
as Spark). Many of the TPC-H queries are IO and Parquet heavy so that can have 
a large impact. For now I suggest to split/partition the data into at roughly 
the number of cores available on the machine files - e.g. something very common 
in Spark workloads.
   In the future we might want to revisit this to utilize more parallelism such 
as for when the number of files is small.  @jorgecarleitao has done some 
research on this in his `parquet2` and `arrow2` repositories.
   
   > 3. Is there a specific batch size that is more attractive to the 
DataFusion engine?
   
   It depends mostly on the query - simple counts or sums favor a bigger batch 
size while other parts might benefit from a not-too-big batch size. Based on 
some earlier benchmarking the default batch size is in DataFusion set to 8192. 
Interested if you have different findings here :).
   
   For the bug - let's try to minimize that example and find the cause of the 
error!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to