Hi Paul, Parth,

Thanks a lot for your insightful, enlightening comments.

It seems that it is possible to avoid or minimise described problem by
careful selection of input file size and number of files for the query.
However, "real life" observation is that these factors are often not easy
to control. In such cases distributed metadata collection would likely
improve response time greatly because reading metadata off DFS should scale
out very well. Prototyping this bit for proof of value sounds very
interesting. As I will be gaining more experience with Drill, I will try to
keep an eye on this topic and come up with some concrete meaningful figures
to better understand potential gain.

I can only confirm rather negative experience with RDBMS as underlying
store for Hadoop components. Basically, one of Hadoop ecosystem's key value
propositions is that it solves some typical scalability and performance
issues existing in RDBMS. Then there comes Hive meta store (or Oozie
database) that introduce exactly those problems back into the equation :)

Best Regards,
Alex

On Fri, Aug 10, 2018 at 2:32 AM, Parth Chandra <par...@apache.org> wrote:

> In addition to directory and row group pruning, the physical plan
> generation looks at data locality for every row group and schedules the
> scan for the row group on the node where data is local. (Remote reads can
> kill performance like nothing else can).
>
> Short answer, query planning requires metadata; schema, statistics, and
> locality being the most important pieces of metadata we need. Drill will
> try to infer schema where it is not available but if the user can provide
> schema, Drill will do a better job. The same with statistics, Drill can
> plan a query without stats but if the stats are available, the execution
> plan will be much more efficient.
>
> So the job boils down to gathering metadata efficiently. As the metadata
> size increases, the task becomes similar to that of executing a query on a
> large data set. The metadata cache is a starting point for collecting
> metadata. Essentially, it means we have pre-fetched metadata. However, with
> a large amount of metadata, say several GB, this approach becomes the
> bottleneck and the solution is really to distribute the metadata collection
> as well. That's where the metastore comes in.
>
> The next level of complexity will then shift to the planning process. A
> single thread processing a few GB of metadata could itself become the
> bottleneck. So the next step in Drill's evolution will be to partition and
> distribute the planning process itself. That will be fun :)
>
>
> On Thu, Aug 9, 2018 at 10:51 AM, Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Alex,
> >
> > Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
> >
> > My understanding is that the planning-time metadata is used for partition
> > (directory) and row group pruning. By scanning the footer of each Parquet
> > row group, Drill can determine whether that group contains data that the
> > query must process. Only groups that could contain target data are
> included
> > in the list of groups to be scanned by the query's scan operators.
> >
> > For example, suppose we do a query over a date range and a US state:
> >
> > SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN
> > '2018-07-01` AND '2018-07-31'AND state = 'CA'
> >
> > Suppose sales is a directory that contains subdirectories partitioned by
> > date, and that state is a dictionary-encoded field with metadata in the
> > Parquet footer.
> >
> > In this case, Drill reads the directory structure to do the partition
> > pruning at planning time. Only the matching directories are added to the
> > list of files to be scanned at run time.
> >
> > Then, Drill scans the footers of each Parquet row group to find those
> with
> > `state` fields that contain the value of 'CA'. (I believe the footer
> > scanning is done only for matching partitions, but I've not looked at
> that
> > code recently.)
> >
> > Finally, Drill passes the matching set of files to the distribution
> > planner, which assigns files to minor fragments distributed across the
> > Drill cluster. The result attempts to maximize locality and minimize
> skew.
> >
> > Could this have been done differently? There is a trade-off. If the work
> > is done in the planner, then it can be very expensive for large data
> sets.
> > But, on the other hand, the resulting distribution plan has minimal skew:
> > each file that is scheduled for scanning does, in fact, need to be
> scanned.
> >
> > Suppose that Drill pushed the partition and row group filtering to the
> > scan operator. In this case, operators that received files outside of the
> > target date range would do nothing, while those with files in the range
> > would need to do further work, resulting in possible skew. (One could
> argue
> > that, if files are randomly distributed across nodes, then, on average,
> > each scan operator will see roughly the same number of included and
> > excluded files, reducing skew. This would be an interesting experiment to
> > try.)
> >
> > Another approach might be to do a two-step execution. First distribute
> the
> > work of doing the partition pruning and row group footer checking. Gather
> > the results. Then, finish planning the query and distribute the execution
> > as is done today. Here, much new code would be needed to implement this
> new
> > form of task, which is probably why it has not yet been done.
> >
> > Might be interesting for someone to prototype that to see if we get
> > improvement in planning performance. My guess is that, a prototype could
> be
> > built that does the "pre-selection" work as a Drill query, just to see if
> > distribution of the work helps. If that is a win, then perhaps a more
> > robust solution could be developed from there.
> >
> > The final factor is Drill's crude-but-effective metadata caching
> > mechanism. Drill reads footer and directory information for all Parquet
> > files in a directory structure and stores it in a series of
> possibly-large
> > JSON files, one in each directory. This mechanism has no good concurrency
> > control. (This is a long-standing, hard-to-fix bug.) Centralizing access
> in
> > the Foreman reduces the potential for concurrent writes on metadata
> refresh
> > (but does not eliminate them since the Forman are themselves
> distributed.)
> > Further, since the metadata is in one big file (in the root directory),
> and
> > is in JSON (which does not support block-splitting), then it is actually
> > not possible to distribute the planning-time selection work. I understand
> > that the team is looking at alternative metadata implementations to avoid
> > some of these existing issues. (There was some discussion on the dev list
> > over the last few months.)
> >
> > In theory it would be possible to adopt a different metadata structure
> > that would better support distributed selection planning. Perhaps farm
> out
> > work at each directory level: in each directory, farm out work for the
> > subdirectories (if they match the partition selection) and files.
> > Recursively apply this down the directory tree. But, again, Drill does
> not
> > have a general purpose task distribution (map/reduce) mechanism; Drill
> only
> > knows how to distribute queries. Store metadata in a format that is
> > versioned (to avoid concurrent read/write access), and block-oriented (to
> > allow splitting the work to read a large metadata file.)
> >
> > It is interesting to note that several systems (including NetFlix's
> > MetaCat, [1]) also hit bottlenecks in metadata access when metadata is
> > stored in a relational DB, as Hive does. Hive-oriented tools have the
> same
> > kind of metadata bottleneck, but for different reasons: because of the
> huge
> > load placed on the Hive metastore. So, there is an opportunity for
> > innovation in this area by observing, as you did, that metadata itself
> is a
> > big data problem and  requires a distributed, concurrent solution.
> >
> > That's my two-cents. Would be great if folks with more detailed Parquet
> > experience could fill in the details (or correct any errors.)
> >
> > Thanks,
> > - Paul
> >
> >
> > [1] https://medium.com/netflix-techblog/metacat-
> > making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
> >
> >     On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin <
> > alexk...@gmail.com> wrote:
> >
> >  Hi Paul and Drill developers,
> >
> > I am sorry for slight off-topic maybe, but I noticed that Drill's foreman
> > collects metadata of all queried files in PLANNING state (ref. class
> > e.g. MetadataGatherer), at least in case of Parquet when using dfs
> plugin.
> > That costs a lot of time when number of queried files is substantial
> since
> > MetadataGatherer task is not distributed across cluster nodes. What is
> the
> > reason behind this collection? This doesn't seem to match schema-on-read
> > philosophy, but then it's maybe just me or my setup, I am still very new
> to
> > Drill.
> >
> > Also, I appreciate the advantage of metastore-free operations - in theory
> > it makes Drill more reliable and less costly to run. At the same time
> there
> > is Drill Metadata store project, meaning that evolution is actually
> > shifting from metastore-free system. What are the reasons of that
> > evolution? Or is it going to be an additional optional feature?
> >
> > Thanks,
> > Best Regards,
> > Alex
> >
> > On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Qiaoyi,
> > >
> > > In general, optimal performance occurs when a system knows the schema
> at
> > > the start and can fully optimize based on that schema. Think C or C++
> > > compilers compared with Java or Python.
> > >
> > > On the other hand, the JVM HotSpot optimizer has shown that one can
> > > achieve very good performance via incremental optimization, but at the
> > cost
> > > of extreme runtime complexity. The benefit is that Java is much more
> > > flexible, machine independent, and simpler than C or C++ (at least for
> > > non-system applications.)
> > >
> > > Python is the other extreme: it is so dynamic that the literature has
> > > shown that it is very difficult to optimize Python at the compiler or
> > > runtime level. (Though, there is some interesting research on this
> topic.
> > > See [1], [2].)
> > >
> > > Drill is somewhere in the middle. Drill does not do code generation at
> > the
> > > start like Impala or Spark do. Nor is it fully interpreted. Rather,
> Drill
> > > is roughly like Java: code generation is done at runtime based on the
> > > observed data types. (The JVM does machine code generation based on
> > > observed execution patterns.) The advantage is that Drill is able to
> > > achieve its pretty-good performance without the cost of a metadata
> system
> > > to provide schema at plan time.
> > >
> > > So, to get the absolute fastest performance (think Impala), you must
> pay
> > > the cost of a metadata system for all queries. Drill gets nearly as
> good
> > > performance without the complexity of the Hive metastore -- a pretty
> good
> > > tradeoff.
> > >
> > > If I may ask, what is your area of interest? Are you looking to use
> Drill
> > > for some project? Or, just interested in how Drill works?
> > >
> > > Thanks,
> > > - Paul
> > >
> > > [1] Towards Practical Gradual Typing: https://blog.acolyer.
> > > org/2015/08/03/towards-practical-gradual-typing/
> > > [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> > > org/2016/02/05/is-sound-gradual-typing-dead/
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >    On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> > > qiaoyi.din...@alibaba-inc.com> wrote:
> > >
> > >  Thanks Paul, good to know the design principals of the Drill query
> > > execution process model.
> > > I am very new to Drill, please bear with me.
> > >
> > > One more question.
> > > As you mentioned, the schema-free processing is the key feature to be
> > > advantage over Spark, is there any performance consideration behind
> this
> > > design except the techniques of the dynamic codegen and vectorization
> > > computation?
> > >
> > > Regards,
> > > Qiaoyi
> > >
> > >
> > > ------------------------------------------------------------------
> > > 发件人:Paul Rogers <par0...@yahoo.com.INVALID>
> > > 发送时间:2018年8月4日(星期六) 02:27
> > > 收件人:dev <dev@drill.apache.org>
> > > 主 题:Re: Is Drill query execution processing model just the same idea
> with
> > > the Spark whole-stage codegen improvement
> > >
> > > Hi Qiaoyi,
> > > As you noted, Drill and Spark have similar models -- but with important
> > > differences.
> > > Drill is schema-on-read (also called "schema less"). In particular,
> this
> > > means that Drill does not know the schema of the data until the first
> row
> > > (actually "record batch") arrives at each operator. Once Drill sees
> that
> > > first batch, it has a data schema, and can generate the corresponding
> > code;
> > > but only for that one operator.
> > > The above process repeats up the fragment ("fragment" is Drill's term
> for
> > > a Spark stage.)
> > > I believe that Spark requires (or at least allows) the user to define a
> > > schema up front. This is particularly true for the more modern data
> frame
> > > APIs.
> > > Do you think the Spark improvement would apply to Drill's case of
> > > determining the schema operator-by-opeartor up the DAG?
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> > > qiaoyi.din...@alibaba-inc.com> wrote:
> > >
> > >
> > > Hi, all.
> > >
> > > I'm very new to Apache Drill.
> > >
> > > I'm quite interest in Drill query execution's implementation.
> > > After a little bit of source code reading, I found it is built on a
> > > processing model quite like a data-centric pushed-based style, which is
> > > very similar with the idea behind the Spark whole-stage codegen
> > > improvement(jira ticket https://issues.apache.org/
> > jira/browse/SPARK-12795)
> > >
> > > And I wonder is there any detailed documentation about this? What's the
> > > consideration behind of our design in the Drill project. : )
> > >
> > > Regards,
> > > Qiaoyi
> > >https://medium.com/netflix-techblog/metacat-making-big-
> > data-discoverable-and-meaningful-at-netflix-56fb36a53520
>

Reply via email to