pdet commented on issue #1329: URL: https://github.com/apache/arrow-datafusion/issues/1329#issuecomment-973929764
Hey guys, thanks for the quick answers and suggestions :-) Also, one thing that I really enjoyed about your project is the easy usability, it literally took me a couple of minutes to get a tpc-h query running, coming from database-server-land this always amazes me! @Dandandan > Hey @pdet > > Thanks for opening this issue! I am personally very interested if there are components within the ecosystem that could be shared between engines such as DuckDB and DataFusion and the pyarrow library. I am thinking about query optimization, re-using UDFs, executing certain parts in one engine or the other, etc. For sure, I think this is one of the most interesting aspects of being able to zero-copy an in-memory data storage format (Exiting Stuff :-)). I'm still considering showcasing an "executing certain parts in one engine or the other", in the blog post, just have to craft a realistic example for that! > For the timings - let me play a bit with your example - I'll come back to it. At first glance, those two things come to mind when looking at the code: > > * It looks like DuckDB doesn't need to load all columns in memory whereas `register_record_batches` pre-loads the data (all columns) in memory. > * DataFusion can benefit from more parallelism for Parquet reading (See notes on question 2.) by splitting your data into _n_ equal partitions / files. So, as far as I understand, Arrow has two main types of objects, materialized (e.g., Arrow Tables, RecordBatches ) and streamed ones. (e.g., Record Batch Reader), DuckDB can consume/produce both of these, and quite similar to DataFusion, it automatically parallelizes on the number of batches (If I got that right). In the code-snippet I sent, we are consuming Arrow Tables, exactly to start off from a point where everything is already materialized in memory. From a streaming perspective, yes we wouldn't need to load unnecessary columns/row groups. > Looking forward to the blog post! > > > 1. It seems to me that DataFusion is not consuming streaming arrow objects (i.e., not fully materialized objects), is that correct? > > I am not 100% certain what you mean here. `register_record_batches` (see PR #981) currently uses `MemTable` which keeps the data as batches in memory (in a fully materialized list). We don't have a streaming table provider as of now which consumes a stream / iterator of Arrow batches. Also would be interesting to see if we can integrate somehow with an Arrow `Table` instead. Yes, my question was again related to record batch readers. You answered it perfectly! @jorgecarleitao Thanks for the suggestions, the main point of my post is to use DuckDB as an execution engine for Arrow, so basically I want to compare with other libraries that have bindings to Python/R and can run SQL, while integrating with Arrow Tables/Record Batches/ Record Batch Readers. I'll check if any of those fit these criteria as well! :-) I'm also familiar with the H2OAI benchmark, I was focusing on TPC-H mainly because of the current blogpost examples, but maybe I should have another look at using H2OAI using as well. @alamb Thanks for the list! Will surely have a look. Just as a clarification, I'm not planning on doing a benchmark comparison with datafusion as engines, but I just want to describe a bit what other options out there can query arrow objects, and a bit of how mature/fast/easy-to-use they are. Thanks again for all of your answers! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
