In Drill the record batches constantly flow between different fragments only in the absence of operators that do not require the entire data set before they can produce any output such as join, sort, aggregate and those operators are almost always present in BI cases.
The size of a packet (RecordBatch) is not necessarily small or large, it is all relative. It is possibly small compared to the entire data set or result set, but it is large compared to a single record. In some cases, the entire data set is a single record batch. It Drill, a foreman creates an execution plan, serializes it and sends it to the drillbits for execution, so Drill also needs to deliver "code" to the workers (drillbits). Thank you, Vlad On 2018/08/07 08:06:24, Joel Pfaff <[email protected]> wrote: > Hello, > > There is one important difference between the two execution models is how > the data flows inside the clusters. > > In Drill: the complete chain of operators is instantiated at the very > beginning of the query, and the data constantly flows between the different > fragments as small packets (named RecordBatch). > Whereas in Spark, every query is decomposed as stages and tasks, and no > data is exchanged before the end of a task. This can cause a Spark cluster > to be mostly idle because a subset of tasks are taking longer than the > other, and this would prevent tasks from the next stage to be scheduled. So > in term of data flow, the Spark model is more: "execute all these tasks, > then do all the exchanges, then execute all the next tasks". > When scheduling a task to a worker, the code is serialized and sent to the > worker to execute, and this step creates some additional latency in the > process. > > This difference allows Drill to have a much better interactivity, > especially for shorter queries with multiple wide operators (sort, group by > ...). > Spark's execution model has other advantages in term of elasticity and > resiliency, since it can support the addition of executors, or survive the > loss of some executors, all of that in the middle of a query. > > Regards, Joel > > On Mon, Aug 6, 2018 at 8:36 AM 丁乔毅(智乔) <[email protected]> > wrote: > > > Thanks Paul, good to know the design principals of the Drill query > > execution process model. > > I am very new to Drill, please bear with me. > > > > One more question. > > As you mentioned, the schema-free processing is the key feature to be > > advantage over Spark, is there any performance consideration behind this > > design except the techniques of the dynamic codegen and vectorization > > computation? > > > > Regards, > > Qiaoyi > > > > > > ------------------------------------------------------------------ > > 发件人:Paul Rogers <[email protected]> > > 发送时间:2018年8月4日(星期六) 02:27 > > 收件人:dev <[email protected]> > > 主 题:Re: Is Drill query execution processing model just the same idea with > > the Spark whole-stage codegen improvement > > > > Hi Qiaoyi, > > As you noted, Drill and Spark have similar models -- but with important > > differences. > > Drill is schema-on-read (also called "schema less"). In particular, this > > means that Drill does not know the schema of the data until the first row > > (actually "record batch") arrives at each operator. Once Drill sees that > > first batch, it has a data schema, and can generate the corresponding code; > > but only for that one operator. > > The above process repeats up the fragment ("fragment" is Drill's term for > > a Spark stage.) > > I believe that Spark requires (or at least allows) the user to define a > > schema up front. This is particularly true for the more modern data frame > > APIs. > > Do you think the Spark improvement would apply to Drill's case of > > determining the schema operator-by-opeartor up the DAG? > > Thanks, > > - Paul > > > > > > > > On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) < > > [email protected]> wrote: > > > > > > Hi, all. > > > > I'm very new to Apache Drill. > > > > I'm quite interest in Drill query execution's implementation. > > After a little bit of source code reading, I found it is built on a > > processing model quite like a data-centric pushed-based style, which is > > very similar with the idea behind the Spark whole-stage codegen > > improvement(jira ticket https://issues.apache.org/jira/browse/SPARK-12795) > > > > And I wonder is there any detailed documentation about this? What's the > > consideration behind of our design in the Drill project. : ) > > > > Regards, > > Qiaoyi >
