alamb commented on PR #2648: URL: https://github.com/apache/arrow-datafusion/pull/2648#issuecomment-1147323320
I agree that plan level JIT is a great idea, thank you @waynexia for writing up the document as well as this PR. I am sorry it took so long to review it. TLDR: 1. I think JIT'ing Exprs (https://github.com/apache/arrow-datafusion/issues/2122) is a required step for fully JIT'ing plans 2. datafusion-contrib is likely a good place for this kind of work until it is ready -- I already feel like we have several partly done features (JIT exprs, scheduler) in the core. However, given most of this PR is in the `jit` crate I am not opposed to adding it too, 3. I would love to see some performance experiments showing the effect of this work I am not sure if you have seen the following paper, but it gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans and I think it is quite relevant to this discussion: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de Here is the canonical plan I think of that benefits from JIT'ing: ```text ▲ ▲ │ │ │ │ ┌───────────────────────┐ ┌──────────────────────────────┐ │ HashAggregate │ │ Compiled Node │ │ gby: x │ │ JIT'ed code: │ │ agg: SUM(y+5) │ │ │ └───────────────────────┘ │ if x > 5 and y != 10: │ ▲ │ hash (x) │ │ │ probe hash table │ │ │ update SUM │ ┌───────────────────────┐ │ │ │ Filter: │ └──────────────────────────────┘ │ x > 5 AND │ ▲ │ y != 10 │ │ └───────────────────────┘ │ ▲ ┌──────────────────────────────┐ │ │ Scan │ │ └──────────────────────────────┘ ┌───────────────────────┐ │ Scan │ └───────────────────────┘ ``` In this case, the code to filter and update the hashtable are compiled together so that input rows from the data source directly update the hash table without ever leaving registers. I believe this kind of plan can be made *super* fast and results state of the art in terms of query performance. The idea of JIT'ing multiple plan nodes together is a necessary part of this for sure, but one of the last ones. The first step is fully JIT'ing the expr evaluation. I really like the idea of using DataFusion's extensibility model to develop / prototype this approach in a datafusion-contrib or other repo until it is mature enough to bring into the core codebase. Though to be honest, perhaps the same approach could/should be used for the new scheduler @tustvold is working on too 🤔 In terms of the cache invalidation issues, I think https://github.com/apache/arrow-datafusion/issues/2199 will help the current vectorized approach to minimize the number of times a batch I will also try to update https://github.com/apache/arrow-datafusion/issues/2122 with some more specifics -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
