Re: 回复：Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

Ted Dunning Sat, 11 Aug 2018 13:12:01 -0700

Actually, I would strongly disagree that a central metadata repository is a
good thing for distributed data. HDFS is a great example of how centralized
metadata turns into a major reliability and consistency nightmare.


It would be much, much better to keep the metadata distributed near the
data.



On Sat, Aug 11, 2018 at 10:41 AM Vitalii Diravka <[email protected]>
wrote:

> Hi all!
>
> I agree with Paul and Parth, Hive Metastore with it's RDMBS is the easiest
> way to manage metadata and statistics in better way than now. And it can be
> used not only for Parquet,
> so it will be good enhancement for Drill. Of course Drill will have own API
> for Metastore, so later other tools can be used instead of Hive Metastore.
> It is also interesting to note there is a work on implementing usage of
> HBase for Hive Metastore [1].
>
> But Paul correctly mentioned that this system will require explicit
> maintaining of consistency between metadata and data. This is not easy task
> to make it automatically.
> Besides that as was mentioned the central metadata store is a bottleneck
> for distributed tools in Hadoop.
> Therefore Drill Metastore can be used only optionally for some cases to
> improve the performance and reliability of reading big amount of data.
>
> In perspective looks like Drill can use the Iceberg format to store the
> data [2]. It has own metadata maintaining mechanism based on separate meta
> files, which are updated automatically.
> I think this format can bring a lot of useful possibilities for Drill, but
> it is still under development.
>
> [1] https://issues.apache.org/jira/browse/HIVE-9452
> [2] https://github.com/Netflix/iceberg
>
>
> Kind regards
> Vitalii
>
>
> On Sat, Aug 11, 2018 at 5:05 AM Paul Rogers <[email protected]>
> wrote:
>
> > Can't resist just a final couple of thoughts on this.
> >
> > First, we discussed how Drill infers schema from input files. We've
> > discussed elsewhere how that can lead to ambiguties (reader 1 does not
> have
> > any way to know what schema reader 2 might read, and so reader 1 has to
> > guess a type if columns are missing. Often Drill readers guess nullable
> > int, leading to type collisions downstream.)
> >
> > We also discussed that Drill uses Parquet metadata to plan queries, which
> > can, at scale, cause long planning times.
> >
> > It is ironic that the two don't work together: all that Parquet planning
> > does not translate to telling the readers the expected column types. If
> all
> > our files have column X, but only half have the newly-added column Y,
> Drill
> > does not tell both readers that "If you see a file without Y, assume
> that Y
> > has type T." Parquet will still guess nullable int.
> >
> > So, at the very least, would be great to pass along type information when
> > available, such as with Parquet. This would require an extension to
> Drill's
> > SchemaPath class to create a "SchemaElement" class that includes the
> type,
> > when known.
> >
> >
> > Second, I want to echo is the observation that an RDBMS is often not the
> > most scalable solution for big data. I'm sure in the early days it was
> > easiest to choose an RDBMS before distributed DBs were widely available.
> > But, if something new is done today, the design should be based on a
> > distributed storage; we should not follow existing tools in choosing an
> > RDBMS.
> >
> > A related issue is that Hive metastore seems to get a bad rap because
> > users often repeatedly ask to refresh metadata so that they can see the
> > newest files. Apparently, the metadata can become stale unless someone
> > manually requests to update it, and users learn to do so with frightening
> > frequency. Drill suffers a bit from the stale data problem with its
> Parquet
> > metadata files; some user has to refresh them. (As mentioned, if two or
> > more users do so concurrently, file corruption can occur which, one
> hopes,
> > a distributed DB can help resolve.)
> >
> >  A modern metastore should automagically update file info as files
> arrive.
> > This post explains an HDFS hook that could be used:
> >
> https://community.hortonworks.com/questions/57661/hdfs-best-way-to-trigger-execution-at-file-arrival.html
> > Haven't tried it myself, but it looks promising. If Drill queues up
> > refreshes based on actual file changes, then Drill controls refreshes and
> > could sidestep the over-frequent refresh issue (and even the concurrency
> > bug.)
> >
> >
> > In short, if Drill provides a metastore, I hope Drill can learn from the
> > experience (good and bad) of prior tools to come up when a solution that
> is
> > even more scalable than that which is on offer today.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Friday, August 10, 2018, 4:55:37 PM PDT, Oleksandr Kalinin <
> > [email protected]> wrote:
> >
> >  Hi Paul, Parth,
> >
> > Thanks a lot for your insightful, enlightening comments.
> >
> > It seems that it is possible to avoid or minimise described problem by
> > careful selection of input file size and number of files for the query.
> > However, "real life" observation is that these factors are often not easy
> > to control. In such cases distributed metadata collection would likely
> > improve response time greatly because reading metadata off DFS should
> scale
> > out very well. Prototyping this bit for proof of value sounds very
> > interesting. As I will be gaining more experience with Drill, I will try
> to
> > keep an eye on this topic and come up with some concrete meaningful
> figures
> > to better understand potential gain.
> >
> > I can only confirm rather negative experience with RDBMS as underlying
> > store for Hadoop components. Basically, one of Hadoop ecosystem's key
> value
> > propositions is that it solves some typical scalability and performance
> > issues existing in RDBMS. Then there comes Hive meta store (or Oozie
> > database) that introduce exactly those problems back into the equation :)
> >
> > Best Regards,
> > Alex
> >
> > On Fri, Aug 10, 2018 at 2:32 AM, Parth Chandra <[email protected]>
> wrote:
> >
> > > In addition to directory and row group pruning, the physical plan
> > > generation looks at data locality for every row group and schedules the
> > > scan for the row group on the node where data is local. (Remote reads
> can
> > > kill performance like nothing else can).
> > >
> > > Short answer, query planning requires metadata; schema, statistics, and
> > > locality being the most important pieces of metadata we need. Drill
> will
> > > try to infer schema where it is not available but if the user can
> provide
> > > schema, Drill will do a better job. The same with statistics, Drill can
> > > plan a query without stats but if the stats are available, the
> execution
> > > plan will be much more efficient.
> > >
> > > So the job boils down to gathering metadata efficiently. As the
> metadata
> > > size increases, the task becomes similar to that of executing a query
> on
> > a
> > > large data set. The metadata cache is a starting point for collecting
> > > metadata. Essentially, it means we have pre-fetched metadata. However,
> > with
> > > a large amount of metadata, say several GB, this approach becomes the
> > > bottleneck and the solution is really to distribute the metadata
> > collection
> > > as well. That's where the metastore comes in.
> > >
> > > The next level of complexity will then shift to the planning process. A
> > > single thread processing a few GB of metadata could itself become the
> > > bottleneck. So the next step in Drill's evolution will be to partition
> > and
> > > distribute the planning process itself. That will be fun :)
> > >
> > >
> > > On Thu, Aug 9, 2018 at 10:51 AM, Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
> > > >
> > > > My understanding is that the planning-time metadata is used for
> > partition
> > > > (directory) and row group pruning. By scanning the footer of each
> > Parquet
> > > > row group, Drill can determine whether that group contains data that
> > the
> > > > query must process. Only groups that could contain target data are
> > > included
> > > > in the list of groups to be scanned by the query's scan operators.
> > > >
> > > > For example, suppose we do a query over a date range and a US state:
> > > >
> > > > SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN
> > > > '2018-07-01` AND '2018-07-31'AND state = 'CA'
> > > >
> > > > Suppose sales is a directory that contains subdirectories partitioned
> > by
> > > > date, and that state is a dictionary-encoded field with metadata in
> the
> > > > Parquet footer.
> > > >
> > > > In this case, Drill reads the directory structure to do the partition
> > > > pruning at planning time. Only the matching directories are added to
> > the
> > > > list of files to be scanned at run time.
> > > >
> > > > Then, Drill scans the footers of each Parquet row group to find those
> > > with
> > > > `state` fields that contain the value of 'CA'. (I believe the footer
> > > > scanning is done only for matching partitions, but I've not looked at
> > > that
> > > > code recently.)
> > > >
> > > > Finally, Drill passes the matching set of files to the distribution
> > > > planner, which assigns files to minor fragments distributed across
> the
> > > > Drill cluster. The result attempts to maximize locality and minimize
> > > skew.
> > > >
> > > > Could this have been done differently? There is a trade-off. If the
> > work
> > > > is done in the planner, then it can be very expensive for large data
> > > sets.
> > > > But, on the other hand, the resulting distribution plan has minimal
> > skew:
> > > > each file that is scheduled for scanning does, in fact, need to be
> > > scanned.
> > > >
> > > > Suppose that Drill pushed the partition and row group filtering to
> the
> > > > scan operator. In this case, operators that received files outside of
> > the
> > > > target date range would do nothing, while those with files in the
> range
> > > > would need to do further work, resulting in possible skew. (One could
> > > argue
> > > > that, if files are randomly distributed across nodes, then, on
> average,
> > > > each scan operator will see roughly the same number of included and
> > > > excluded files, reducing skew. This would be an interesting
> experiment
> > to
> > > > try.)
> > > >
> > > > Another approach might be to do a two-step execution. First
> distribute
> > > the
> > > > work of doing the partition pruning and row group footer checking.
> > Gather
> > > > the results. Then, finish planning the query and distribute the
> > execution
> > > > as is done today. Here, much new code would be needed to implement
> this
> > > new
> > > > form of task, which is probably why it has not yet been done.
> > > >
> > > > Might be interesting for someone to prototype that to see if we get
> > > > improvement in planning performance. My guess is that, a prototype
> > could
> > > be
> > > > built that does the "pre-selection" work as a Drill query, just to
> see
> > if
> > > > distribution of the work helps. If that is a win, then perhaps a more
> > > > robust solution could be developed from there.
> > > >
> > > > The final factor is Drill's crude-but-effective metadata caching
> > > > mechanism. Drill reads footer and directory information for all
> Parquet
> > > > files in a directory structure and stores it in a series of
> > > possibly-large
> > > > JSON files, one in each directory. This mechanism has no good
> > concurrency
> > > > control. (This is a long-standing, hard-to-fix bug.) Centralizing
> > access
> > > in
> > > > the Foreman reduces the potential for concurrent writes on metadata
> > > refresh
> > > > (but does not eliminate them since the Forman are themselves
> > > distributed.)
> > > > Further, since the metadata is in one big file (in the root
> directory),
> > > and
> > > > is in JSON (which does not support block-splitting), then it is
> > actually
> > > > not possible to distribute the planning-time selection work. I
> > understand
> > > > that the team is looking at alternative metadata implementations to
> > avoid
> > > > some of these existing issues. (There was some discussion on the dev
> > list
> > > > over the last few months.)
> > > >
> > > > In theory it would be possible to adopt a different metadata
> structure
> > > > that would better support distributed selection planning. Perhaps
> farm
> > > out
> > > > work at each directory level: in each directory, farm out work for
> the
> > > > subdirectories (if they match the partition selection) and files.
> > > > Recursively apply this down the directory tree. But, again, Drill
> does
> > > not
> > > > have a general purpose task distribution (map/reduce) mechanism;
> Drill
> > > only
> > > > knows how to distribute queries. Store metadata in a format that is
> > > > versioned (to avoid concurrent read/write access), and block-oriented
> > (to
> > > > allow splitting the work to read a large metadata file.)
> > > >
> > > > It is interesting to note that several systems (including NetFlix's
> > > > MetaCat, [1]) also hit bottlenecks in metadata access when metadata
> is
> > > > stored in a relational DB, as Hive does. Hive-oriented tools have the
> > > same
> > > > kind of metadata bottleneck, but for different reasons: because of
> the
> > > huge
> > > > load placed on the Hive metastore. So, there is an opportunity for
> > > > innovation in this area by observing, as you did, that metadata
> itself
> > > is a
> > > > big data problem and  requires a distributed, concurrent solution.
> > > >
> > > > That's my two-cents. Would be great if folks with more detailed
> Parquet
> > > > experience could fill in the details (or correct any errors.)
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > > [1] https://medium.com/netflix-techblog/metacat-
> > > > making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
> > > >
> > > >    On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin <
> > > > [email protected]> wrote:
> > > >
> > > >  Hi Paul and Drill developers,
> > > >
> > > > I am sorry for slight off-topic maybe, but I noticed that Drill's
> > foreman
> > > > collects metadata of all queried files in PLANNING state (ref. class
> > > > e.g. MetadataGatherer), at least in case of Parquet when using dfs
> > > plugin.
> > > > That costs a lot of time when number of queried files is substantial
> > > since
> > > > MetadataGatherer task is not distributed across cluster nodes. What
> is
> > > the
> > > > reason behind this collection? This doesn't seem to match
> > schema-on-read
> > > > philosophy, but then it's maybe just me or my setup, I am still very
> > new
> > > to
> > > > Drill.
> > > >
> > > > Also, I appreciate the advantage of metastore-free operations - in
> > theory
> > > > it makes Drill more reliable and less costly to run. At the same time
> > > there
> > > > is Drill Metadata store project, meaning that evolution is actually
> > > > shifting from metastore-free system. What are the reasons of that
> > > > evolution? Or is it going to be an additional optional feature?
> > > >
> > > > Thanks,
> > > > Best Regards,
> > > > Alex
> > > >
> > > > On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers
> <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi Qiaoyi,
> > > > >
> > > > > In general, optimal performance occurs when a system knows the
> schema
> > > at
> > > > > the start and can fully optimize based on that schema. Think C or
> C++
> > > > > compilers compared with Java or Python.
> > > > >
> > > > > On the other hand, the JVM HotSpot optimizer has shown that one can
> > > > > achieve very good performance via incremental optimization, but at
> > the
> > > > cost
> > > > > of extreme runtime complexity. The benefit is that Java is much
> more
> > > > > flexible, machine independent, and simpler than C or C++ (at least
> > for
> > > > > non-system applications.)
> > > > >
> > > > > Python is the other extreme: it is so dynamic that the literature
> has
> > > > > shown that it is very difficult to optimize Python at the compiler
> or
> > > > > runtime level. (Though, there is some interesting research on this
> > > topic.
> > > > > See [1], [2].)
> > > > >
> > > > > Drill is somewhere in the middle. Drill does not do code generation
> > at
> > > > the
> > > > > start like Impala or Spark do. Nor is it fully interpreted. Rather,
> > > Drill
> > > > > is roughly like Java: code generation is done at runtime based on
> the
> > > > > observed data types. (The JVM does machine code generation based on
> > > > > observed execution patterns.) The advantage is that Drill is able
> to
> > > > > achieve its pretty-good performance without the cost of a metadata
> > > system
> > > > > to provide schema at plan time.
> > > > >
> > > > > So, to get the absolute fastest performance (think Impala), you
> must
> > > pay
> > > > > the cost of a metadata system for all queries. Drill gets nearly as
> > > good
> > > > > performance without the complexity of the Hive metastore -- a
> pretty
> > > good
> > > > > tradeoff.
> > > > >
> > > > > If I may ask, what is your area of interest? Are you looking to use
> > > Drill
> > > > > for some project? Or, just interested in how Drill works?
> > > > >
> > > > > Thanks,
> > > > > - Paul
> > > > >
> > > > > [1] Towards Practical Gradual Typing: https://blog.acolyer.
> > > > > org/2015/08/03/towards-practical-gradual-typing/
> > > > > [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> > > > > org/2016/02/05/is-sound-gradual-typing-dead/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >    On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> > > > > [email protected]> wrote:
> > > > >
> > > > >  Thanks Paul, good to know the design principals of the Drill query
> > > > > execution process model.
> > > > > I am very new to Drill, please bear with me.
> > > > >
> > > > > One more question.
> > > > > As you mentioned, the schema-free processing is the key feature to
> be
> > > > > advantage over Spark, is there any performance consideration behind
> > > this
> > > > > design except the techniques of the dynamic codegen and
> vectorization
> > > > > computation?
> > > > >
> > > > > Regards,
> > > > > Qiaoyi
> > > > >
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > 发件人：Paul Rogers <[email protected]>
> > > > > 发送时间：2018年8月4日(星期六) 02:27
> > > > > 收件人：dev <[email protected]>
> > > > > 主 题：Re: Is Drill query execution processing model just the same
> idea
> > > with
> > > > > the Spark whole-stage codegen improvement
> > > > >
> > > > > Hi Qiaoyi,
> > > > > As you noted, Drill and Spark have similar models -- but with
> > important
> > > > > differences.
> > > > > Drill is schema-on-read (also called "schema less"). In particular,
> > > this
> > > > > means that Drill does not know the schema of the data until the
> first
> > > row
> > > > > (actually "record batch") arrives at each operator. Once Drill sees
> > > that
> > > > > first batch, it has a data schema, and can generate the
> corresponding
> > > > code;
> > > > > but only for that one operator.
> > > > > The above process repeats up the fragment ("fragment" is Drill's
> term
> > > for
> > > > > a Spark stage.)
> > > > > I believe that Spark requires (or at least allows) the user to
> > define a
> > > > > schema up front. This is particularly true for the more modern data
> > > frame
> > > > > APIs.
> > > > > Do you think the Spark improvement would apply to Drill's case of
> > > > > determining the schema operator-by-opeartor up the DAG?
> > > > > Thanks,
> > > > > - Paul
> > > > >
> > > > >
> > > > >
> > > > >    On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> > > > > [email protected]> wrote:
> > > > >
> > > > >
> > > > > Hi, all.
> > > > >
> > > > > I'm very new to Apache Drill.
> > > > >
> > > > > I'm quite interest in Drill query execution's implementation.
> > > > > After a little bit of source code reading, I found it is built on a
> > > > > processing model quite like a data-centric pushed-based style,
> which
> > is
> > > > > very similar with the idea behind the Spark whole-stage codegen
> > > > > improvement(jira ticket https://issues.apache.org/
> > > > jira/browse/SPARK-12795)
> > > > >
> > > > > And I wonder is there any detailed documentation about this? What's
> > the
> > > > > consideration behind of our design in the Drill project. : )
> > > > >
> > > > > Regards,
> > > > > Qiaoyi
> > > > >https://medium.com/netflix-techblog/metacat-making-big-
> > > > data-discoverable-and-meaningful-at-netflix-56fb36a53520
> > >
>

Re: 回复：Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

Reply via email to