pdet commented on issue #1329:
URL: 
https://github.com/apache/arrow-datafusion/issues/1329#issuecomment-973929764


   Hey guys,  thanks for the quick answers and suggestions :-)
   Also, one thing that I really enjoyed about your project is the easy 
usability, it literally took me a couple of minutes to get a tpc-h query 
running, coming from database-server-land this always amazes me!
   
   @Dandandan 
   > Hey @pdet
   > 
   > Thanks for opening this issue! I am personally very interested if there 
are components within the ecosystem that could be shared between engines such 
as DuckDB and DataFusion and the pyarrow library. I am thinking about query 
optimization, re-using UDFs, executing certain parts in one engine or the 
other, etc.
   
   For sure, I think this is one of the most interesting aspects of being able 
to zero-copy an in-memory data storage format (Exiting Stuff :-)). I'm still 
considering showcasing an "executing certain parts in one engine or the other", 
in the blog post, just have to craft a realistic example for that! 
   
   > For the timings - let me play a bit with your example - I'll come back to 
it. At first glance, those two things come to mind when looking at the code:
   > 
   > * It looks like DuckDB doesn't need to load all columns in memory whereas 
`register_record_batches` pre-loads the data (all columns) in memory.
   > * DataFusion can benefit from more parallelism for Parquet reading (See 
notes on question 2.) by splitting your data into _n_ equal partitions / files.
   
   So, as far as I understand, Arrow has two main types of objects, 
materialized (e.g., Arrow Tables, RecordBatches ) and streamed ones. (e.g., 
Record Batch Reader), DuckDB can consume/produce both of these, and quite 
similar to DataFusion, it automatically parallelizes on the number of batches 
(If I got that right). In the code-snippet I sent, we are consuming Arrow 
Tables, exactly to start off from a point where everything is already 
materialized in memory. From a streaming perspective, yes we wouldn't need to 
load unnecessary columns/row groups.
   
   > Looking forward to the blog post!
   > 
   > > 1. It seems to me that DataFusion is not consuming streaming arrow 
objects (i.e., not fully materialized objects), is that correct?
   > 
   > I am not 100% certain what you mean here. `register_record_batches` (see 
PR #981) currently uses `MemTable` which keeps the data as batches in memory 
(in a fully materialized list). We don't have a streaming table provider as of 
now which consumes a stream / iterator of Arrow batches. Also would be 
interesting to see if we can integrate somehow with an Arrow `Table` instead.
   Yes, my question was again related to record batch readers. You answered it 
perfectly!
   
   
   
   @jorgecarleitao 
   Thanks for the suggestions, the main point of my post is to use DuckDB as an 
execution engine for Arrow, so basically I want to compare with other libraries 
that have bindings to Python/R and can run SQL, while integrating with Arrow 
Tables/Record Batches/ Record Batch Readers. I'll check if any of those fit 
these criteria as well! :-)
   I'm also familiar with the H2OAI benchmark, I was focusing on TPC-H mainly 
because of the current blogpost examples, but maybe I should have another look 
at using H2OAI using as well.
   
   
   @alamb 
   Thanks for the list! Will surely have a look.
   
   Just as a clarification, I'm not planning on doing a benchmark comparison 
with datafusion as engines, but I just want to describe a bit what other 
options out there can query arrow objects, and a bit of how 
mature/fast/easy-to-use they are.
   
   Thanks again for all of your answers!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to