That approach makes a lot of sense. That said, it’s not as radical as they make it sound. The Volcano execution model went out a long time ago. Here’s the history from my career: * When I was at Oracle in ’95, they used an improved version of Volcano that called the “next” with a callback method that was called back with a few dozen rows before “next” returned. * At SQLstream in 2004 and LucidDB operators would work on ~64KB buffers of rows serialized into a cache-efficient format. We used neither the “pull” approach (driven by a consuming thread) or the “push” approach (driven by a producer thread) but we had a scheduler that invoked operators that had work to do, and tried to invoke operators in sequence so that the data was still in cache. * Following MonetDB and X100 every DB engine moved to SIMD-friendly data structures. The ones initially written for the Java heap (Hive and, yes, Spark) eventually followed suit. * Drill makes extensive use of generated Java code, even for UDFs, carefully generated so that Hotspot can optimize it. * More and more of the Java-based engines are moving to off-heap memory. It has many benefits, but I hear that Hotspot is not as good at optimizing accesses to off-heap memory as it is at accessing, say, a Java long[].
Following this trend, and looking to the future, these would be my architectural recommendations: * Don’t write your own engine! * Translate your queries to an algebra (e.g. Calcite) * Have that algebra translate to a high-performance engine (e.g. Drill, Spark, Hive or Flink) * Use an efficient memory format (e.g. Arrow) so that engines can efficiently exchange data. Julian > On May 27, 2016, at 3:15 AM, Albert <[email protected]> wrote: > > I was reading article (and references) on the speedup gain in spark2. > apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html > <https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html> > > The main idea is that physical code generated should now be data centric > instead of operator centric, and preserve data locality. > > I am thinking maybe this applies to calcite as well. in terms of switching > to the data centric approach, what could calcite do and gain? > > > > > quote:>The Future: Whole-stage Code Generation > > From the above observation, a natural next step for us was to explore the > possibility of automatically generating this *handwritten* code at runtime, > which we are calling “whole-stage code generation.” This idea is inspired > by Thomas Neumann’s seminal VLDB 2011 paper on*Efficiently Compiling > Efficient Query Plans for Modern Hardware > <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>*. For more details on the > paper, Adrian Colyer has coordinated with us to publish a review on The > Morning Paper blog > <http://blog.acolyer.org/2016/05/23/efficiently-compiling-efficient-query-plans-for-modern-hardware> > today. > > The goal is to leverage whole-stage code generation so the engine can > achieve the performance of hand-written code, yet provide the functionality > of a general purpose engine. Rather than relying on operators for > processing data at runtime, these operators together generate code at > runtime and collapse each fragment of the query, where possible, into a > single function and execute that generated code instead. > > -- > ~~~~~~~~~~~~~~~ > no mistakes > ~~~~~~~~~~~~~~~~~~
