Re: 回复：Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

Vitalii Diravka Sat, 11 Aug 2018 10:41:47 -0700

Hi all!

I agree with Paul and Parth, Hive Metastore with it's RDMBS is the easiest
way to manage metadata and statistics in better way than now. And it can be
used not only for Parquet,
so it will be good enhancement for Drill. Of course Drill will have own API
for Metastore, so later other tools can be used instead of Hive Metastore.
It is also interesting to note there is a work on implementing usage of
HBase for Hive Metastore [1].


But Paul correctly mentioned that this system will require explicit
maintaining of consistency between metadata and data. This is not easy task
to make it automatically.
Besides that as was mentioned the central metadata store is a bottleneck
for distributed tools in Hadoop.
Therefore Drill Metastore can be used only optionally for some cases to
improve the performance and reliability of reading big amount of data.

In perspective looks like Drill can use the Iceberg format to store the
data [2]. It has own metadata maintaining mechanism based on separate meta
files, which are updated automatically.
I think this format can bring a lot of useful possibilities for Drill, but
it is still under development.

[1] https://issues.apache.org/jira/browse/HIVE-9452
[2] https://github.com/Netflix/iceberg


Kind regards
Vitalii


On Sat, Aug 11, 2018 at 5:05 AM Paul Rogers <[email protected]>
wrote:

> Can't resist just a final couple of thoughts on this.
>
> First, we discussed how Drill infers schema from input files. We've
> discussed elsewhere how that can lead to ambiguties (reader 1 does not have
> any way to know what schema reader 2 might read, and so reader 1 has to
> guess a type if columns are missing. Often Drill readers guess nullable
> int, leading to type collisions downstream.)
>
> We also discussed that Drill uses Parquet metadata to plan queries, which
> can, at scale, cause long planning times.
>
> It is ironic that the two don't work together: all that Parquet planning
> does not translate to telling the readers the expected column types. If all
> our files have column X, but only half have the newly-added column Y, Drill
> does not tell both readers that "If you see a file without Y, assume that Y
> has type T." Parquet will still guess nullable int.
>
> So, at the very least, would be great to pass along type information when
> available, such as with Parquet. This would require an extension to Drill's
> SchemaPath class to create a "SchemaElement" class that includes the type,
> when known.
>
>
> Second, I want to echo is the observation that an RDBMS is often not the
> most scalable solution for big data. I'm sure in the early days it was
> easiest to choose an RDBMS before distributed DBs were widely available.
> But, if something new is done today, the design should be based on a
> distributed storage; we should not follow existing tools in choosing an
> RDBMS.
>
> A related issue is that Hive metastore seems to get a bad rap because
> users often repeatedly ask to refresh metadata so that they can see the
> newest files. Apparently, the metadata can become stale unless someone
> manually requests to update it, and users learn to do so with frightening
> frequency. Drill suffers a bit from the stale data problem with its Parquet
> metadata files; some user has to refresh them. (As mentioned, if two or
> more users do so concurrently, file corruption can occur which, one hopes,
> a distributed DB can help resolve.)
>
>  A modern metastore should automagically update file info as files arrive.
> This post explains an HDFS hook that could be used:
> https://community.hortonworks.com/questions/57661/hdfs-best-way-to-trigger-execution-at-file-arrival.html
> Haven't tried it myself, but it looks promising. If Drill queues up
> refreshes based on actual file changes, then Drill controls refreshes and
> could sidestep the over-frequent refresh issue (and even the concurrency
> bug.)
>
>
> In short, if Drill provides a metastore, I hope Drill can learn from the
> experience (good and bad) of prior tools to come up when a solution that is
> even more scalable than that which is on offer today.
>
>
> Thanks,
> - Paul
>
>
>
>     On Friday, August 10, 2018, 4:55:37 PM PDT, Oleksandr Kalinin <
> [email protected]> wrote:
>
>  Hi Paul, Parth,
>
> Thanks a lot for your insightful, enlightening comments.
>
> It seems that it is possible to avoid or minimise described problem by
> careful selection of input file size and number of files for the query.
> However, "real life" observation is that these factors are often not easy
> to control. In such cases distributed metadata collection would likely
> improve response time greatly because reading metadata off DFS should scale
> out very well. Prototyping this bit for proof of value sounds very
> interesting. As I will be gaining more experience with Drill, I will try to
> keep an eye on this topic and come up with some concrete meaningful figures
> to better understand potential gain.
>
> I can only confirm rather negative experience with RDBMS as underlying
> store for Hadoop components. Basically, one of Hadoop ecosystem's key value
> propositions is that it solves some typical scalability and performance
> issues existing in RDBMS. Then there comes Hive meta store (or Oozie
> database) that introduce exactly those problems back into the equation :)
>
> Best Regards,
> Alex
>
> On Fri, Aug 10, 2018 at 2:32 AM, Parth Chandra <[email protected]> wrote:
>
> > In addition to directory and row group pruning, the physical plan
> > generation looks at data locality for every row group and schedules the
> > scan for the row group on the node where data is local. (Remote reads can
> > kill performance like nothing else can).
> >
> > Short answer, query planning requires metadata; schema, statistics, and
> > locality being the most important pieces of metadata we need. Drill will
> > try to infer schema where it is not available but if the user can provide
> > schema, Drill will do a better job. The same with statistics, Drill can
> > plan a query without stats but if the stats are available, the execution
> > plan will be much more efficient.
> >
> > So the job boils down to gathering metadata efficiently. As the metadata
> > size increases, the task becomes similar to that of executing a query on
> a
> > large data set. The metadata cache is a starting point for collecting
> > metadata. Essentially, it means we have pre-fetched metadata. However,
> with
> > a large amount of metadata, say several GB, this approach becomes the
> > bottleneck and the solution is really to distribute the metadata
> collection
> > as well. That's where the metastore comes in.
> >
> > The next level of complexity will then shift to the planning process. A
> > single thread processing a few GB of metadata could itself become the
> > bottleneck. So the next step in Drill's evolution will be to partition
> and
> > distribute the planning process itself. That will be fun :)
> >
> >
> > On Thu, Aug 9, 2018 at 10:51 AM, Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi Alex,
> > >
> > > Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
> > >
> > > My understanding is that the planning-time metadata is used for
> partition
> > > (directory) and row group pruning. By scanning the footer of each
> Parquet
> > > row group, Drill can determine whether that group contains data that
> the
> > > query must process. Only groups that could contain target data are
> > included
> > > in the list of groups to be scanned by the query's scan operators.
> > >
> > > For example, suppose we do a query over a date range and a US state:
> > >
> > > SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN
> > > '2018-07-01` AND '2018-07-31'AND state = 'CA'
> > >
> > > Suppose sales is a directory that contains subdirectories partitioned
> by
> > > date, and that state is a dictionary-encoded field with metadata in the
> > > Parquet footer.
> > >
> > > In this case, Drill reads the directory structure to do the partition
> > > pruning at planning time. Only the matching directories are added to
> the
> > > list of files to be scanned at run time.
> > >
> > > Then, Drill scans the footers of each Parquet row group to find those
> > with
> > > `state` fields that contain the value of 'CA'. (I believe the footer
> > > scanning is done only for matching partitions, but I've not looked at
> > that
> > > code recently.)
> > >
> > > Finally, Drill passes the matching set of files to the distribution
> > > planner, which assigns files to minor fragments distributed across the
> > > Drill cluster. The result attempts to maximize locality and minimize
> > skew.
> > >
> > > Could this have been done differently? There is a trade-off. If the
> work
> > > is done in the planner, then it can be very expensive for large data
> > sets.
> > > But, on the other hand, the resulting distribution plan has minimal
> skew:
> > > each file that is scheduled for scanning does, in fact, need to be
> > scanned.
> > >
> > > Suppose that Drill pushed the partition and row group filtering to the
> > > scan operator. In this case, operators that received files outside of
> the
> > > target date range would do nothing, while those with files in the range
> > > would need to do further work, resulting in possible skew. (One could
> > argue
> > > that, if files are randomly distributed across nodes, then, on average,
> > > each scan operator will see roughly the same number of included and
> > > excluded files, reducing skew. This would be an interesting experiment
> to
> > > try.)
> > >
> > > Another approach might be to do a two-step execution. First distribute
> > the
> > > work of doing the partition pruning and row group footer checking.
> Gather
> > > the results. Then, finish planning the query and distribute the
> execution
> > > as is done today. Here, much new code would be needed to implement this
> > new
> > > form of task, which is probably why it has not yet been done.
> > >
> > > Might be interesting for someone to prototype that to see if we get
> > > improvement in planning performance. My guess is that, a prototype
> could
> > be
> > > built that does the "pre-selection" work as a Drill query, just to see
> if
> > > distribution of the work helps. If that is a win, then perhaps a more
> > > robust solution could be developed from there.
> > >
> > > The final factor is Drill's crude-but-effective metadata caching
> > > mechanism. Drill reads footer and directory information for all Parquet
> > > files in a directory structure and stores it in a series of
> > possibly-large
> > > JSON files, one in each directory. This mechanism has no good
> concurrency
> > > control. (This is a long-standing, hard-to-fix bug.) Centralizing
> access
> > in
> > > the Foreman reduces the potential for concurrent writes on metadata
> > refresh
> > > (but does not eliminate them since the Forman are themselves
> > distributed.)
> > > Further, since the metadata is in one big file (in the root directory),
> > and
> > > is in JSON (which does not support block-splitting), then it is
> actually
> > > not possible to distribute the planning-time selection work. I
> understand
> > > that the team is looking at alternative metadata implementations to
> avoid
> > > some of these existing issues. (There was some discussion on the dev
> list
> > > over the last few months.)
> > >
> > > In theory it would be possible to adopt a different metadata structure
> > > that would better support distributed selection planning. Perhaps farm
> > out
> > > work at each directory level: in each directory, farm out work for the
> > > subdirectories (if they match the partition selection) and files.
> > > Recursively apply this down the directory tree. But, again, Drill does
> > not
> > > have a general purpose task distribution (map/reduce) mechanism; Drill
> > only
> > > knows how to distribute queries. Store metadata in a format that is
> > > versioned (to avoid concurrent read/write access), and block-oriented
> (to
> > > allow splitting the work to read a large metadata file.)
> > >
> > > It is interesting to note that several systems (including NetFlix's
> > > MetaCat, [1]) also hit bottlenecks in metadata access when metadata is
> > > stored in a relational DB, as Hive does. Hive-oriented tools have the
> > same
> > > kind of metadata bottleneck, but for different reasons: because of the
> > huge
> > > load placed on the Hive metastore. So, there is an opportunity for
> > > innovation in this area by observing, as you did, that metadata itself
> > is a
> > > big data problem and  requires a distributed, concurrent solution.
> > >
> > > That's my two-cents. Would be great if folks with more detailed Parquet
> > > experience could fill in the details (or correct any errors.)
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > > [1] https://medium.com/netflix-techblog/metacat-
> > > making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
> > >
> > >    On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin <
> > > [email protected]> wrote:
> > >
> > >  Hi Paul and Drill developers,
> > >
> > > I am sorry for slight off-topic maybe, but I noticed that Drill's
> foreman
> > > collects metadata of all queried files in PLANNING state (ref. class
> > > e.g. MetadataGatherer), at least in case of Parquet when using dfs
> > plugin.
> > > That costs a lot of time when number of queried files is substantial
> > since
> > > MetadataGatherer task is not distributed across cluster nodes. What is
> > the
> > > reason behind this collection? This doesn't seem to match
> schema-on-read
> > > philosophy, but then it's maybe just me or my setup, I am still very
> new
> > to
> > > Drill.
> > >
> > > Also, I appreciate the advantage of metastore-free operations - in
> theory
> > > it makes Drill more reliable and less costly to run. At the same time
> > there
> > > is Drill Metadata store project, meaning that evolution is actually
> > > shifting from metastore-free system. What are the reasons of that
> > > evolution? Or is it going to be an additional optional feature?
> > >
> > > Thanks,
> > > Best Regards,
> > > Alex
> > >
> > > On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Qiaoyi,
> > > >
> > > > In general, optimal performance occurs when a system knows the schema
> > at
> > > > the start and can fully optimize based on that schema. Think C or C++
> > > > compilers compared with Java or Python.
> > > >
> > > > On the other hand, the JVM HotSpot optimizer has shown that one can
> > > > achieve very good performance via incremental optimization, but at
> the
> > > cost
> > > > of extreme runtime complexity. The benefit is that Java is much more
> > > > flexible, machine independent, and simpler than C or C++ (at least
> for
> > > > non-system applications.)
> > > >
> > > > Python is the other extreme: it is so dynamic that the literature has
> > > > shown that it is very difficult to optimize Python at the compiler or
> > > > runtime level. (Though, there is some interesting research on this
> > topic.
> > > > See [1], [2].)
> > > >
> > > > Drill is somewhere in the middle. Drill does not do code generation
> at
> > > the
> > > > start like Impala or Spark do. Nor is it fully interpreted. Rather,
> > Drill
> > > > is roughly like Java: code generation is done at runtime based on the
> > > > observed data types. (The JVM does machine code generation based on
> > > > observed execution patterns.) The advantage is that Drill is able to
> > > > achieve its pretty-good performance without the cost of a metadata
> > system
> > > > to provide schema at plan time.
> > > >
> > > > So, to get the absolute fastest performance (think Impala), you must
> > pay
> > > > the cost of a metadata system for all queries. Drill gets nearly as
> > good
> > > > performance without the complexity of the Hive metastore -- a pretty
> > good
> > > > tradeoff.
> > > >
> > > > If I may ask, what is your area of interest? Are you looking to use
> > Drill
> > > > for some project? Or, just interested in how Drill works?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > > [1] Towards Practical Gradual Typing: https://blog.acolyer.
> > > > org/2015/08/03/towards-practical-gradual-typing/
> > > > [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> > > > org/2016/02/05/is-sound-gradual-typing-dead/
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >    On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> > > > [email protected]> wrote:
> > > >
> > > >  Thanks Paul, good to know the design principals of the Drill query
> > > > execution process model.
> > > > I am very new to Drill, please bear with me.
> > > >
> > > > One more question.
> > > > As you mentioned, the schema-free processing is the key feature to be
> > > > advantage over Spark, is there any performance consideration behind
> > this
> > > > design except the techniques of the dynamic codegen and vectorization
> > > > computation?
> > > >
> > > > Regards,
> > > > Qiaoyi
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > 发件人：Paul Rogers <[email protected]>
> > > > 发送时间：2018年8月4日(星期六) 02:27
> > > > 收件人：dev <[email protected]>
> > > > 主 题：Re: Is Drill query execution processing model just the same idea
> > with
> > > > the Spark whole-stage codegen improvement
> > > >
> > > > Hi Qiaoyi,
> > > > As you noted, Drill and Spark have similar models -- but with
> important
> > > > differences.
> > > > Drill is schema-on-read (also called "schema less"). In particular,
> > this
> > > > means that Drill does not know the schema of the data until the first
> > row
> > > > (actually "record batch") arrives at each operator. Once Drill sees
> > that
> > > > first batch, it has a data schema, and can generate the corresponding
> > > code;
> > > > but only for that one operator.
> > > > The above process repeats up the fragment ("fragment" is Drill's term
> > for
> > > > a Spark stage.)
> > > > I believe that Spark requires (or at least allows) the user to
> define a
> > > > schema up front. This is particularly true for the more modern data
> > frame
> > > > APIs.
> > > > Do you think the Spark improvement would apply to Drill's case of
> > > > determining the schema operator-by-opeartor up the DAG?
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> > > > [email protected]> wrote:
> > > >
> > > >
> > > > Hi, all.
> > > >
> > > > I'm very new to Apache Drill.
> > > >
> > > > I'm quite interest in Drill query execution's implementation.
> > > > After a little bit of source code reading, I found it is built on a
> > > > processing model quite like a data-centric pushed-based style, which
> is
> > > > very similar with the idea behind the Spark whole-stage codegen
> > > > improvement(jira ticket https://issues.apache.org/
> > > jira/browse/SPARK-12795)
> > > >
> > > > And I wonder is there any detailed documentation about this? What's
> the
> > > > consideration behind of our design in the Drill project. : )
> > > >
> > > > Regards,
> > > > Qiaoyi
> > > >https://medium.com/netflix-techblog/metacat-making-big-
> > > data-discoverable-and-meaningful-at-netflix-56fb36a53520
> >

Re: 回复：Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

Reply via email to