Dandandan commented on issue #1329: URL: https://github.com/apache/arrow-datafusion/issues/1329#issuecomment-973015663
Hey @pdet Thanks for opening this issue! I am personally very interested if there are components within the ecosystem that could be shared between engines such as DuckDB and DataFusion and the pyarrow library. I am thinking about query optimization, re-using UDFs, executing certain parts in one engine or the other, etc. For the timings - let me play a bit with your example - I'll come back to it. At first glance, those two things come to mind when looking at the code: * It looks like DuckDB doesn't need to load all columns in memory whereas `register_record_batches` pre-loads the data (all columns) in memory. * DataFusion can benefit from more parallelism for Parquet reading (See notes on question 2.) by splitting your data into *n* equal partitions / files. Looking forward to the blog post! > 1. It seems to me that DataFusion is not consuming streaming arrow objects (i.e., not fully materialized objects), is that correct? I am not 100% certain what you mean here. `register_record_batches` (see PR https://github.com/apache/arrow-datafusion/pull/981) currently uses `MemTable` which keeps the data as batches in memory (in a fully materialized list). We don't have a streaming table provider as of now which consumes a stream / iterator of Arrow batches. Also would be interesting to see if we can integrate somehow with an Arrow `Table` instead. You can register tables with stored files based on a location (by using`register_parquet` instead). This streams the data from Parquet (or CSV/JSON/Avro) from disk. > 2. Is DataFusion already running in parallel? Yes - plans are optimized use as much parallelism as possible. Scans, aggregates, joins, filters, etc. are parallelized (based on available cores by default). I added some optimizations to this in earlier versions https://medium.com/@danilheres/increasing-the-level-of-parallelism-in-datafusion-4-0-d2a15b5a2093 For TPC-H queries - parallelism depends a large part on the amount of parquet files you have as input, as 1 file maps to 1 IO task currently (similar as Spark). Many of the TPC-H queries are IO and Parquet heavy so that can have a large impact. For now I suggest to split/partition the data into at roughly the number of cores available on the machine files - e.g. something very common in Spark workloads. In the future we might want to revisit this to utilize more parallelism such as for when the number of files is small. @jorgecarleitao has done some research on this in his `parquet2` and `arrow2` repositories. > 3. Is there a specific batch size that is more attractive to the DataFusion engine? It depends mostly on the query - simple counts or sums favor a bigger batch size while other parts might benefit from a not-too-big batch size. Based on some earlier benchmarking the default batch size is in DataFusion set to 8192. Interested if you have different findings here :). For the bug - let's try to minimize that example and find the cause of the error! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
