Hi Paul and Drill developers,

I am sorry for slight off-topic maybe, but I noticed that Drill's foreman
collects metadata of all queried files in PLANNING state (ref. class
e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin.
That costs a lot of time when number of queried files is substantial since
MetadataGatherer task is not distributed across cluster nodes. What is the
reason behind this collection? This doesn't seem to match schema-on-read
philosophy, but then it's maybe just me or my setup, I am still very new to
Drill.

Also, I appreciate the advantage of metastore-free operations - in theory
it makes Drill more reliable and less costly to run. At the same time there
is Drill Metadata store project, meaning that evolution is actually
shifting from metastore-free system. What are the reasons of that
evolution? Or is it going to be an additional optional feature?

Thanks,
Best Regards,
Alex

On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <[email protected]>
wrote:

> Hi Qiaoyi,
>
> In general, optimal performance occurs when a system knows the schema at
> the start and can fully optimize based on that schema. Think C or C++
> compilers compared with Java or Python.
>
> On the other hand, the JVM HotSpot optimizer has shown that one can
> achieve very good performance via incremental optimization, but at the cost
> of extreme runtime complexity. The benefit is that Java is much more
> flexible, machine independent, and simpler than C or C++ (at least for
> non-system applications.)
>
> Python is the other extreme: it is so dynamic that the literature has
> shown that it is very difficult to optimize Python at the compiler or
> runtime level. (Though, there is some interesting research on this topic.
> See [1], [2].)
>
> Drill is somewhere in the middle. Drill does not do code generation at the
> start like Impala or Spark do. Nor is it fully interpreted. Rather, Drill
> is roughly like Java: code generation is done at runtime based on the
> observed data types. (The JVM does machine code generation based on
> observed execution patterns.) The advantage is that Drill is able to
> achieve its pretty-good performance without the cost of a metadata system
> to provide schema at plan time.
>
> So, to get the absolute fastest performance (think Impala), you must pay
> the cost of a metadata system for all queries. Drill gets nearly as good
> performance without the complexity of the Hive metastore -- a pretty good
> tradeoff.
>
> If I may ask, what is your area of interest? Are you looking to use Drill
> for some project? Or, just interested in how Drill works?
>
> Thanks,
> - Paul
>
> [1] Towards Practical Gradual Typing: https://blog.acolyer.
> org/2015/08/03/towards-practical-gradual-typing/
> [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> org/2016/02/05/is-sound-gradual-typing-dead/
>
>
>
>
>
>
>
>     On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> [email protected]> wrote:
>
>  Thanks Paul, good to know the design principals of the Drill query
> execution process model.
> I am very new to Drill, please bear with me.
>
> One more question.
> As you mentioned, the schema-free processing is the key feature to be
> advantage over Spark, is there any performance consideration behind this
> design except the techniques of the dynamic codegen and vectorization
> computation?
>
> Regards,
> Qiaoyi
>
>
> ------------------------------------------------------------------
> 发件人:Paul Rogers <[email protected]>
> 发送时间:2018年8月4日(星期六) 02:27
> 收件人:dev <[email protected]>
> 主 题:Re: Is Drill query execution processing model just the same idea with
> the Spark whole-stage codegen improvement
>
> Hi Qiaoyi,
> As you noted, Drill and Spark have similar models -- but with important
> differences.
> Drill is schema-on-read (also called "schema less"). In particular, this
> means that Drill does not know the schema of the data until the first row
> (actually "record batch") arrives at each operator. Once Drill sees that
> first batch, it has a data schema, and can generate the corresponding code;
> but only for that one operator.
> The above process repeats up the fragment ("fragment" is Drill's term for
> a Spark stage.)
> I believe that Spark requires (or at least allows) the user to define a
> schema up front. This is particularly true for the more modern data frame
> APIs.
> Do you think the Spark improvement would apply to Drill's case of
> determining the schema operator-by-opeartor up the DAG?
> Thanks,
> - Paul
>
>
>
>     On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> [email protected]> wrote:
>
>
> Hi, all.
>
> I'm very new to Apache Drill.
>
> I'm quite interest in Drill query execution's implementation.
> After a little bit of source code reading, I found it is built on a
> processing model quite like a data-centric pushed-based style, which is
> very similar with the idea behind the Spark whole-stage codegen
> improvement(jira ticket https://issues.apache.org/jira/browse/SPARK-12795)
>
> And I wonder is there any detailed documentation about this? What's the
> consideration behind of our design in the Drill project. : )
>
> Regards,
> Qiaoyi
>

Reply via email to