Re: [I] Row groups are read out of order or with completely different values [datafusion]

via GitHub Thu, 06 Jun 2024 23:57:13 -0700


twitu commented on issue #10572:
URL: https://github.com/apache/datafusion/issues/10572#issuecomment-2154211869


   
   I also added a query explanation and here're the results.
   
   ```
       plan: Analyze
         Sort: data.ts_init ASC NULLS LAST
           Projection: data.bid, data.ask, data.bid_size, data.ask_size, 
data.ts_event, data.ts_init
             TableScan: data,
   ```
   ```
       plan: Analyze
         Projection: data.bid, data.ask, data.bid_size, data.ask_size, 
data.ts_event, data.ts_init
           TableScan: data,
   ```
   
   You'll see that the queries with `ORDER BY` have  a Sort expression in the 
plan. It's not clear to me why despite specifying the sort order in the 
configuration the plan still has a sort. I hope the optimizations you've 
mentioned will take this into account.
   
   ```
           file_sort_order: vec![vec![Expr::Sort(Sort {
               expr: Box::new(col("ts_init")),
               asc: true,
               nulls_first: true,
           })]],
   ```
   
   You'll see in the experiments repo that I've implement binaries for both 
reading a single row group and reading the whole file. The queries are the same 
the behaviour is changed in how the resulting stream is consumed. I don't think 
this is equivalent to adding a `LIMIT `clause because for the purpose of the 
query I'm reading the whole file. It is only that the consumer decides to stop 
after reading one row group.
   
   > > It seems like order=false and repartition=false seems to be the holy 
grail of performant, low-memory footprint streaming in sorted order.
   
   > I would expect those settings will be the lowest latency (time to first 
batch)
   
   Surprisingly, these settings also give lowest latency and memory foot print 
when reading the full file as shown in the above table.
   
   If you need an additional contributor in any of the above mentioned issues, 
I'm happy to help :smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Row groups are read out of order or with completely different values [datafusion]

Reply via email to