Hi Paul (and all),

I would like to come back to your remark in context of metadata collection
distribution : "Drill does not have a general purpose task distribution
(map/reduce) mechanism".

The question is : does it have to be full-blown map-reduce mechanism or
could it be something more simple and straightforward, like map with 1
reduce task (star distribution), with no resiliency handling, no persisting
results etc. - i.e. single task failure just fails the entire job and query?

For instance, for Parquet metadata collection task following seems to apply
:
- Entire distributed operation should not take more than a couple of
seconds. Even for hundreds of thousands and more queried files it should
scale out well thanks to DFS scalability. Say, if metadata collection by
foreman takes about 10-15 seconds on 50K files query, in theory it should
take less than a second if work is distributed across 15 nodes or more (to
be verified :-))
- In most environments cumulative cluster resources taken for such job are
low - we are (hopefully) talking potentially about max. some several GBs of
processed/transferred data. Hence cost of failing entire job on individual
task/node failure is also low - much lower than in some giant map-reduce
jobs where completed work preservation is highly important to avoid waste
of resources.

Above could also be applicable to other tasks with good parallelism and
similar in terms of processed volume of data.

To summarize - could it be that distributing work like Parquet metadata
collection and pruning is a "low hanging fruit"? E.g. conversion of
multi-threaded MetadataGatherer job into distributed one by converting
local threads work into RPC on cluster Drillbits should in theory not be a
huge change?

Could this be a function that newly developed metadata APIs could provide,
or is this not in scope? (@Vitalii & Vova, I could not listen to today's
hangouts session unfortunately, sorry for possible ignorance)

Thanks,
Best Regards,
Alex

On Thu, Aug 9, 2018 at 7:51 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Alex,
>
> Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
>
> My understanding is that the planning-time metadata is used for partition
> (directory) and row group pruning. By scanning the footer of each Parquet
> row group, Drill can determine whether that group contains data that the
> query must process. Only groups that could contain target data are included
> in the list of groups to be scanned by the query's scan operators.
>
> For example, suppose we do a query over a date range and a US state:
>
> SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN
> '2018-07-01` AND '2018-07-31'AND state = 'CA'
>
> Suppose sales is a directory that contains subdirectories partitioned by
> date, and that state is a dictionary-encoded field with metadata in the
> Parquet footer.
>
> In this case, Drill reads the directory structure to do the partition
> pruning at planning time. Only the matching directories are added to the
> list of files to be scanned at run time.
>
> Then, Drill scans the footers of each Parquet row group to find those with
> `state` fields that contain the value of 'CA'. (I believe the footer
> scanning is done only for matching partitions, but I've not looked at that
> code recently.)
>
> Finally, Drill passes the matching set of files to the distribution
> planner, which assigns files to minor fragments distributed across the
> Drill cluster. The result attempts to maximize locality and minimize skew.
>
> Could this have been done differently? There is a trade-off. If the work
> is done in the planner, then it can be very expensive for large data sets.
> But, on the other hand, the resulting distribution plan has minimal skew:
> each file that is scheduled for scanning does, in fact, need to be scanned.
>
> Suppose that Drill pushed the partition and row group filtering to the
> scan operator. In this case, operators that received files outside of the
> target date range would do nothing, while those with files in the range
> would need to do further work, resulting in possible skew. (One could argue
> that, if files are randomly distributed across nodes, then, on average,
> each scan operator will see roughly the same number of included and
> excluded files, reducing skew. This would be an interesting experiment to
> try.)
>
> Another approach might be to do a two-step execution. First distribute the
> work of doing the partition pruning and row group footer checking. Gather
> the results. Then, finish planning the query and distribute the execution
> as is done today. Here, much new code would be needed to implement this new
> form of task, which is probably why it has not yet been done.
>
> Might be interesting for someone to prototype that to see if we get
> improvement in planning performance. My guess is that, a prototype could be
> built that does the "pre-selection" work as a Drill query, just to see if
> distribution of the work helps. If that is a win, then perhaps a more
> robust solution could be developed from there.
>
> The final factor is Drill's crude-but-effective metadata caching
> mechanism. Drill reads footer and directory information for all Parquet
> files in a directory structure and stores it in a series of possibly-large
> JSON files, one in each directory. This mechanism has no good concurrency
> control. (This is a long-standing, hard-to-fix bug.) Centralizing access in
> the Foreman reduces the potential for concurrent writes on metadata refresh
> (but does not eliminate them since the Forman are themselves distributed.)
> Further, since the metadata is in one big file (in the root directory), and
> is in JSON (which does not support block-splitting), then it is actually
> not possible to distribute the planning-time selection work. I understand
> that the team is looking at alternative metadata implementations to avoid
> some of these existing issues. (There was some discussion on the dev list
> over the last few months.)
>
> In theory it would be possible to adopt a different metadata structure
> that would better support distributed selection planning. Perhaps farm out
> work at each directory level: in each directory, farm out work for the
> subdirectories (if they match the partition selection) and files.
> Recursively apply this down the directory tree. But, again, Drill does not
> have a general purpose task distribution (map/reduce) mechanism; Drill only
> knows how to distribute queries. Store metadata in a format that is
> versioned (to avoid concurrent read/write access), and block-oriented (to
> allow splitting the work to read a large metadata file.)
>
> It is interesting to note that several systems (including NetFlix's
> MetaCat, [1]) also hit bottlenecks in metadata access when metadata is
> stored in a relational DB, as Hive does. Hive-oriented tools have the same
> kind of metadata bottleneck, but for different reasons: because of the huge
> load placed on the Hive metastore. So, there is an opportunity for
> innovation in this area by observing, as you did, that metadata itself is a
> big data problem and  requires a distributed, concurrent solution.
>
> That's my two-cents. Would be great if folks with more detailed Parquet
> experience could fill in the details (or correct any errors.)
>
> Thanks,
> - Paul
>
>
> [1]
> https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
>
>     On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin <
> alexk...@gmail.com> wrote:
>
>  Hi Paul and Drill developers,
>
> I am sorry for slight off-topic maybe, but I noticed that Drill's foreman
> collects metadata of all queried files in PLANNING state (ref. class
> e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin.
> That costs a lot of time when number of queried files is substantial since
> MetadataGatherer task is not distributed across cluster nodes. What is the
> reason behind this collection? This doesn't seem to match schema-on-read
> philosophy, but then it's maybe just me or my setup, I am still very new to
> Drill.
>
> Also, I appreciate the advantage of metastore-free operations - in theory
> it makes Drill more reliable and less costly to run. At the same time there
> is Drill Metadata store project, meaning that evolution is actually
> shifting from metastore-free system. What are the reasons of that
> evolution? Or is it going to be an additional optional feature?
>
> Thanks,
> Best Regards,
> Alex
>
> On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Qiaoyi,
> >
> > In general, optimal performance occurs when a system knows the schema at
> > the start and can fully optimize based on that schema. Think C or C++
> > compilers compared with Java or Python.
> >
> > On the other hand, the JVM HotSpot optimizer has shown that one can
> > achieve very good performance via incremental optimization, but at the
> cost
> > of extreme runtime complexity. The benefit is that Java is much more
> > flexible, machine independent, and simpler than C or C++ (at least for
> > non-system applications.)
> >
> > Python is the other extreme: it is so dynamic that the literature has
> > shown that it is very difficult to optimize Python at the compiler or
> > runtime level. (Though, there is some interesting research on this topic.
> > See [1], [2].)
> >
> > Drill is somewhere in the middle. Drill does not do code generation at
> the
> > start like Impala or Spark do. Nor is it fully interpreted. Rather, Drill
> > is roughly like Java: code generation is done at runtime based on the
> > observed data types. (The JVM does machine code generation based on
> > observed execution patterns.) The advantage is that Drill is able to
> > achieve its pretty-good performance without the cost of a metadata system
> > to provide schema at plan time.
> >
> > So, to get the absolute fastest performance (think Impala), you must pay
> > the cost of a metadata system for all queries. Drill gets nearly as good
> > performance without the complexity of the Hive metastore -- a pretty good
> > tradeoff.
> >
> > If I may ask, what is your area of interest? Are you looking to use Drill
> > for some project? Or, just interested in how Drill works?
> >
> > Thanks,
> > - Paul
> >
> > [1] Towards Practical Gradual Typing: https://blog.acolyer.
> > org/2015/08/03/towards-practical-gradual-typing/
> > [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> > org/2016/02/05/is-sound-gradual-typing-dead/
> >
> >
> >
> >
> >
> >
> >
> >    On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> > qiaoyi.din...@alibaba-inc.com> wrote:
> >
> >  Thanks Paul, good to know the design principals of the Drill query
> > execution process model.
> > I am very new to Drill, please bear with me.
> >
> > One more question.
> > As you mentioned, the schema-free processing is the key feature to be
> > advantage over Spark, is there any performance consideration behind this
> > design except the techniques of the dynamic codegen and vectorization
> > computation?
> >
> > Regards,
> > Qiaoyi
> >
> >
> > ------------------------------------------------------------------
> > 发件人:Paul Rogers <par0...@yahoo.com.INVALID>
> > 发送时间:2018年8月4日(星期六) 02:27
> > 收件人:dev <dev@drill.apache.org>
> > 主 题:Re: Is Drill query execution processing model just the same idea with
> > the Spark whole-stage codegen improvement
> >
> > Hi Qiaoyi,
> > As you noted, Drill and Spark have similar models -- but with important
> > differences.
> > Drill is schema-on-read (also called "schema less"). In particular, this
> > means that Drill does not know the schema of the data until the first row
> > (actually "record batch") arrives at each operator. Once Drill sees that
> > first batch, it has a data schema, and can generate the corresponding
> code;
> > but only for that one operator.
> > The above process repeats up the fragment ("fragment" is Drill's term for
> > a Spark stage.)
> > I believe that Spark requires (or at least allows) the user to define a
> > schema up front. This is particularly true for the more modern data frame
> > APIs.
> > Do you think the Spark improvement would apply to Drill's case of
> > determining the schema operator-by-opeartor up the DAG?
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> > qiaoyi.din...@alibaba-inc.com> wrote:
> >
> >
> > Hi, all.
> >
> > I'm very new to Apache Drill.
> >
> > I'm quite interest in Drill query execution's implementation.
> > After a little bit of source code reading, I found it is built on a
> > processing model quite like a data-centric pushed-based style, which is
> > very similar with the idea behind the Spark whole-stage codegen
> > improvement(jira ticket
> https://issues.apache.org/jira/browse/SPARK-12795)
> >
> > And I wonder is there any detailed documentation about this? What's the
> > consideration behind of our design in the Drill project. : )
> >
> > Regards,
> > Qiaoyi
> >
> https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
>

Reply via email to