Re: [I] Row groups are read out of order or with completely different values [datafusion]

via GitHub Thu, 06 Jun 2024 15:05:44 -0700


alamb commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2153478083


   Hi @twitu  -- thanks for this
   
   Some comments:
   
   I took a quick peek at 
https://github.com/nautechsystems/nautilus_experiments/tree/efficient-query 
and: it looks like it reads only a single batch out 
https://github.com/nautechsystems/nautilus_experiments/blob/a4ceb950de3b4bbc43ec82b64ee1495d077f5116/src/bin/single_row_group.rs#L45
   
   This means you are running the equivalent of `SELECT ... LIMIT 4000` or 
something similar
   
   > It seems like order=false and repartition=false seems to be the holy grail 
of performant, low-memory footprint streaming in sorted order.
   
   I would expect those settings will be the lowest latency (time to first 
batch) 
   
   > However, As you can see, even after specifying the sort order of the file 
the ORDER BY query still loads the whole file and does some kind of sort 
operation. 
   
   I would expect that it loads the first row group of each file and begins 
merging them together (e.g. if you did an `EXPLAIN ...` on your SQL you would 
see a `SortPreservingMerge` in the physical plan
   
   @suremarc and @matthewmturner  and others  have been working on optimizing a 
similar case -- see https://github.com/apache/datafusion/issues/7490 for 
example. We have other items tracked 
https://github.com/apache/datafusion/issues/10313 (notably we have a way to 
avoid opening all the files if we know the data is already sorted and doesn't 
overlap: https://github.com/apache/datafusion/issues/10316) 
   
   cc @NGA-TRAN 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Row groups are read out of order or with completely different values [datafusion]

Reply via email to