alamb commented on PR #2648:
URL: 
https://github.com/apache/arrow-datafusion/pull/2648#issuecomment-1147323320

   I agree that plan level JIT is a great idea, thank you @waynexia  for 
writing up the document as well as this PR. I am sorry it took so long to 
review it. 
   
   TLDR:
   1. I think JIT'ing Exprs 
(https://github.com/apache/arrow-datafusion/issues/2122)  is a required step 
for fully JIT'ing plans
   2. datafusion-contrib is likely a good place for this kind of work until it 
is ready -- I already feel like we have several partly done features (JIT 
exprs, scheduler) in the core. However, given most of this PR is in the `jit` 
crate I am not opposed to adding it too,
   3. I would love to see some performance experiments showing the effect of 
this work
   
   I am not sure if you have seen the following paper, but it gives a good 
treatment on the various tradeoffs between vectorized and JIT's compilation of 
query plans and I think it is quite relevant to this discussion:  
https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de 
   
   Here is the canonical plan I think of that benefits from JIT'ing:
   
   ```text
               ▲                                      ▲               
               │                                      │               
               │                                      │               
   ┌───────────────────────┐          ┌──────────────────────────────┐
   │     HashAggregate     │          │        Compiled Node         │
   │        gby: x         │          │         JIT'ed code:         │
   │     agg: SUM(y+5)     │          │                              │
   └───────────────────────┘          │    if x > 5 and y != 10:     │
               ▲                      │      hash (x)                │
               │                      │      probe hash table        │
               │                      │      update SUM              │
   ┌───────────────────────┐          │                              │
   │        Filter:        │          └──────────────────────────────┘
   │       x > 5 AND       │                          ▲               
   │        y != 10        │                          │               
   └───────────────────────┘                          │               
               ▲                      ┌──────────────────────────────┐
               │                      │             Scan             │
               │                      └──────────────────────────────┘
   ┌───────────────────────┐                                          
   │         Scan          │                                          
   └───────────────────────┘                                          
   ```
   
   In this case, the code to filter and update the hashtable are compiled 
together so that input rows from the data source directly update the hash table 
without ever leaving registers. I believe this kind of plan can be made *super* 
fast and results state of the art in terms of query performance.
   
   The idea of JIT'ing multiple plan nodes together is a necessary part of this 
for sure, but one of the last ones. The first step is fully JIT'ing the expr 
evaluation.
   
   I really like the idea of using DataFusion's extensibility model to develop 
/ prototype this approach in a datafusion-contrib or other repo until it is 
mature enough to bring into the core codebase. Though to be honest, perhaps the 
same approach could/should be used for the new scheduler @tustvold  is working 
on too 🤔 
   
   In terms of the cache invalidation issues, I think 
https://github.com/apache/arrow-datafusion/issues/2199 will help the current 
vectorized approach to minimize the number of times a batch
   
   I will also try to update 
https://github.com/apache/arrow-datafusion/issues/2122 with some more specifics
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to