Actually, I would strongly disagree that a central metadata repository is a good thing for distributed data. HDFS is a great example of how centralized metadata turns into a major reliability and consistency nightmare.
It would be much, much better to keep the metadata distributed near the data. On Sat, Aug 11, 2018 at 10:41 AM Vitalii Diravka <[email protected]> wrote: > Hi all! > > I agree with Paul and Parth, Hive Metastore with it's RDMBS is the easiest > way to manage metadata and statistics in better way than now. And it can be > used not only for Parquet, > so it will be good enhancement for Drill. Of course Drill will have own API > for Metastore, so later other tools can be used instead of Hive Metastore. > It is also interesting to note there is a work on implementing usage of > HBase for Hive Metastore [1]. > > But Paul correctly mentioned that this system will require explicit > maintaining of consistency between metadata and data. This is not easy task > to make it automatically. > Besides that as was mentioned the central metadata store is a bottleneck > for distributed tools in Hadoop. > Therefore Drill Metastore can be used only optionally for some cases to > improve the performance and reliability of reading big amount of data. > > In perspective looks like Drill can use the Iceberg format to store the > data [2]. It has own metadata maintaining mechanism based on separate meta > files, which are updated automatically. > I think this format can bring a lot of useful possibilities for Drill, but > it is still under development. > > [1] https://issues.apache.org/jira/browse/HIVE-9452 > [2] https://github.com/Netflix/iceberg > > > Kind regards > Vitalii > > > On Sat, Aug 11, 2018 at 5:05 AM Paul Rogers <[email protected]> > wrote: > > > Can't resist just a final couple of thoughts on this. > > > > First, we discussed how Drill infers schema from input files. We've > > discussed elsewhere how that can lead to ambiguties (reader 1 does not > have > > any way to know what schema reader 2 might read, and so reader 1 has to > > guess a type if columns are missing. Often Drill readers guess nullable > > int, leading to type collisions downstream.) > > > > We also discussed that Drill uses Parquet metadata to plan queries, which > > can, at scale, cause long planning times. > > > > It is ironic that the two don't work together: all that Parquet planning > > does not translate to telling the readers the expected column types. If > all > > our files have column X, but only half have the newly-added column Y, > Drill > > does not tell both readers that "If you see a file without Y, assume > that Y > > has type T." Parquet will still guess nullable int. > > > > So, at the very least, would be great to pass along type information when > > available, such as with Parquet. This would require an extension to > Drill's > > SchemaPath class to create a "SchemaElement" class that includes the > type, > > when known. > > > > > > Second, I want to echo is the observation that an RDBMS is often not the > > most scalable solution for big data. I'm sure in the early days it was > > easiest to choose an RDBMS before distributed DBs were widely available. > > But, if something new is done today, the design should be based on a > > distributed storage; we should not follow existing tools in choosing an > > RDBMS. > > > > A related issue is that Hive metastore seems to get a bad rap because > > users often repeatedly ask to refresh metadata so that they can see the > > newest files. Apparently, the metadata can become stale unless someone > > manually requests to update it, and users learn to do so with frightening > > frequency. Drill suffers a bit from the stale data problem with its > Parquet > > metadata files; some user has to refresh them. (As mentioned, if two or > > more users do so concurrently, file corruption can occur which, one > hopes, > > a distributed DB can help resolve.) > > > > A modern metastore should automagically update file info as files > arrive. > > This post explains an HDFS hook that could be used: > > > https://community.hortonworks.com/questions/57661/hdfs-best-way-to-trigger-execution-at-file-arrival.html > > Haven't tried it myself, but it looks promising. If Drill queues up > > refreshes based on actual file changes, then Drill controls refreshes and > > could sidestep the over-frequent refresh issue (and even the concurrency > > bug.) > > > > > > In short, if Drill provides a metastore, I hope Drill can learn from the > > experience (good and bad) of prior tools to come up when a solution that > is > > even more scalable than that which is on offer today. > > > > > > Thanks, > > - Paul > > > > > > > > On Friday, August 10, 2018, 4:55:37 PM PDT, Oleksandr Kalinin < > > [email protected]> wrote: > > > > Hi Paul, Parth, > > > > Thanks a lot for your insightful, enlightening comments. > > > > It seems that it is possible to avoid or minimise described problem by > > careful selection of input file size and number of files for the query. > > However, "real life" observation is that these factors are often not easy > > to control. In such cases distributed metadata collection would likely > > improve response time greatly because reading metadata off DFS should > scale > > out very well. Prototyping this bit for proof of value sounds very > > interesting. As I will be gaining more experience with Drill, I will try > to > > keep an eye on this topic and come up with some concrete meaningful > figures > > to better understand potential gain. > > > > I can only confirm rather negative experience with RDBMS as underlying > > store for Hadoop components. Basically, one of Hadoop ecosystem's key > value > > propositions is that it solves some typical scalability and performance > > issues existing in RDBMS. Then there comes Hive meta store (or Oozie > > database) that introduce exactly those problems back into the equation :) > > > > Best Regards, > > Alex > > > > On Fri, Aug 10, 2018 at 2:32 AM, Parth Chandra <[email protected]> > wrote: > > > > > In addition to directory and row group pruning, the physical plan > > > generation looks at data locality for every row group and schedules the > > > scan for the row group on the node where data is local. (Remote reads > can > > > kill performance like nothing else can). > > > > > > Short answer, query planning requires metadata; schema, statistics, and > > > locality being the most important pieces of metadata we need. Drill > will > > > try to infer schema where it is not available but if the user can > provide > > > schema, Drill will do a better job. The same with statistics, Drill can > > > plan a query without stats but if the stats are available, the > execution > > > plan will be much more efficient. > > > > > > So the job boils down to gathering metadata efficiently. As the > metadata > > > size increases, the task becomes similar to that of executing a query > on > > a > > > large data set. The metadata cache is a starting point for collecting > > > metadata. Essentially, it means we have pre-fetched metadata. However, > > with > > > a large amount of metadata, say several GB, this approach becomes the > > > bottleneck and the solution is really to distribute the metadata > > collection > > > as well. That's where the metastore comes in. > > > > > > The next level of complexity will then shift to the planning process. A > > > single thread processing a few GB of metadata could itself become the > > > bottleneck. So the next step in Drill's evolution will be to partition > > and > > > distribute the planning process itself. That will be fun :) > > > > > > > > > On Thu, Aug 9, 2018 at 10:51 AM, Paul Rogers <[email protected] > > > > > wrote: > > > > > > > Hi Alex, > > > > > > > > Perhaps Parth can jump in here as he has deeper knowledge of Parquet. > > > > > > > > My understanding is that the planning-time metadata is used for > > partition > > > > (directory) and row group pruning. By scanning the footer of each > > Parquet > > > > row group, Drill can determine whether that group contains data that > > the > > > > query must process. Only groups that could contain target data are > > > included > > > > in the list of groups to be scanned by the query's scan operators. > > > > > > > > For example, suppose we do a query over a date range and a US state: > > > > > > > > SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN > > > > '2018-07-01` AND '2018-07-31'AND state = 'CA' > > > > > > > > Suppose sales is a directory that contains subdirectories partitioned > > by > > > > date, and that state is a dictionary-encoded field with metadata in > the > > > > Parquet footer. > > > > > > > > In this case, Drill reads the directory structure to do the partition > > > > pruning at planning time. Only the matching directories are added to > > the > > > > list of files to be scanned at run time. > > > > > > > > Then, Drill scans the footers of each Parquet row group to find those > > > with > > > > `state` fields that contain the value of 'CA'. (I believe the footer > > > > scanning is done only for matching partitions, but I've not looked at > > > that > > > > code recently.) > > > > > > > > Finally, Drill passes the matching set of files to the distribution > > > > planner, which assigns files to minor fragments distributed across > the > > > > Drill cluster. The result attempts to maximize locality and minimize > > > skew. > > > > > > > > Could this have been done differently? There is a trade-off. If the > > work > > > > is done in the planner, then it can be very expensive for large data > > > sets. > > > > But, on the other hand, the resulting distribution plan has minimal > > skew: > > > > each file that is scheduled for scanning does, in fact, need to be > > > scanned. > > > > > > > > Suppose that Drill pushed the partition and row group filtering to > the > > > > scan operator. In this case, operators that received files outside of > > the > > > > target date range would do nothing, while those with files in the > range > > > > would need to do further work, resulting in possible skew. (One could > > > argue > > > > that, if files are randomly distributed across nodes, then, on > average, > > > > each scan operator will see roughly the same number of included and > > > > excluded files, reducing skew. This would be an interesting > experiment > > to > > > > try.) > > > > > > > > Another approach might be to do a two-step execution. First > distribute > > > the > > > > work of doing the partition pruning and row group footer checking. > > Gather > > > > the results. Then, finish planning the query and distribute the > > execution > > > > as is done today. Here, much new code would be needed to implement > this > > > new > > > > form of task, which is probably why it has not yet been done. > > > > > > > > Might be interesting for someone to prototype that to see if we get > > > > improvement in planning performance. My guess is that, a prototype > > could > > > be > > > > built that does the "pre-selection" work as a Drill query, just to > see > > if > > > > distribution of the work helps. If that is a win, then perhaps a more > > > > robust solution could be developed from there. > > > > > > > > The final factor is Drill's crude-but-effective metadata caching > > > > mechanism. Drill reads footer and directory information for all > Parquet > > > > files in a directory structure and stores it in a series of > > > possibly-large > > > > JSON files, one in each directory. This mechanism has no good > > concurrency > > > > control. (This is a long-standing, hard-to-fix bug.) Centralizing > > access > > > in > > > > the Foreman reduces the potential for concurrent writes on metadata > > > refresh > > > > (but does not eliminate them since the Forman are themselves > > > distributed.) > > > > Further, since the metadata is in one big file (in the root > directory), > > > and > > > > is in JSON (which does not support block-splitting), then it is > > actually > > > > not possible to distribute the planning-time selection work. I > > understand > > > > that the team is looking at alternative metadata implementations to > > avoid > > > > some of these existing issues. (There was some discussion on the dev > > list > > > > over the last few months.) > > > > > > > > In theory it would be possible to adopt a different metadata > structure > > > > that would better support distributed selection planning. Perhaps > farm > > > out > > > > work at each directory level: in each directory, farm out work for > the > > > > subdirectories (if they match the partition selection) and files. > > > > Recursively apply this down the directory tree. But, again, Drill > does > > > not > > > > have a general purpose task distribution (map/reduce) mechanism; > Drill > > > only > > > > knows how to distribute queries. Store metadata in a format that is > > > > versioned (to avoid concurrent read/write access), and block-oriented > > (to > > > > allow splitting the work to read a large metadata file.) > > > > > > > > It is interesting to note that several systems (including NetFlix's > > > > MetaCat, [1]) also hit bottlenecks in metadata access when metadata > is > > > > stored in a relational DB, as Hive does. Hive-oriented tools have the > > > same > > > > kind of metadata bottleneck, but for different reasons: because of > the > > > huge > > > > load placed on the Hive metastore. So, there is an opportunity for > > > > innovation in this area by observing, as you did, that metadata > itself > > > is a > > > > big data problem and requires a distributed, concurrent solution. > > > > > > > > That's my two-cents. Would be great if folks with more detailed > Parquet > > > > experience could fill in the details (or correct any errors.) > > > > > > > > Thanks, > > > > - Paul > > > > > > > > > > > > [1] https://medium.com/netflix-techblog/metacat- > > > > making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520 > > > > > > > > On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin < > > > > [email protected]> wrote: > > > > > > > > Hi Paul and Drill developers, > > > > > > > > I am sorry for slight off-topic maybe, but I noticed that Drill's > > foreman > > > > collects metadata of all queried files in PLANNING state (ref. class > > > > e.g. MetadataGatherer), at least in case of Parquet when using dfs > > > plugin. > > > > That costs a lot of time when number of queried files is substantial > > > since > > > > MetadataGatherer task is not distributed across cluster nodes. What > is > > > the > > > > reason behind this collection? This doesn't seem to match > > schema-on-read > > > > philosophy, but then it's maybe just me or my setup, I am still very > > new > > > to > > > > Drill. > > > > > > > > Also, I appreciate the advantage of metastore-free operations - in > > theory > > > > it makes Drill more reliable and less costly to run. At the same time > > > there > > > > is Drill Metadata store project, meaning that evolution is actually > > > > shifting from metastore-free system. What are the reasons of that > > > > evolution? Or is it going to be an additional optional feature? > > > > > > > > Thanks, > > > > Best Regards, > > > > Alex > > > > > > > > On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers > <[email protected] > > > > > > > wrote: > > > > > > > > > Hi Qiaoyi, > > > > > > > > > > In general, optimal performance occurs when a system knows the > schema > > > at > > > > > the start and can fully optimize based on that schema. Think C or > C++ > > > > > compilers compared with Java or Python. > > > > > > > > > > On the other hand, the JVM HotSpot optimizer has shown that one can > > > > > achieve very good performance via incremental optimization, but at > > the > > > > cost > > > > > of extreme runtime complexity. The benefit is that Java is much > more > > > > > flexible, machine independent, and simpler than C or C++ (at least > > for > > > > > non-system applications.) > > > > > > > > > > Python is the other extreme: it is so dynamic that the literature > has > > > > > shown that it is very difficult to optimize Python at the compiler > or > > > > > runtime level. (Though, there is some interesting research on this > > > topic. > > > > > See [1], [2].) > > > > > > > > > > Drill is somewhere in the middle. Drill does not do code generation > > at > > > > the > > > > > start like Impala or Spark do. Nor is it fully interpreted. Rather, > > > Drill > > > > > is roughly like Java: code generation is done at runtime based on > the > > > > > observed data types. (The JVM does machine code generation based on > > > > > observed execution patterns.) The advantage is that Drill is able > to > > > > > achieve its pretty-good performance without the cost of a metadata > > > system > > > > > to provide schema at plan time. > > > > > > > > > > So, to get the absolute fastest performance (think Impala), you > must > > > pay > > > > > the cost of a metadata system for all queries. Drill gets nearly as > > > good > > > > > performance without the complexity of the Hive metastore -- a > pretty > > > good > > > > > tradeoff. > > > > > > > > > > If I may ask, what is your area of interest? Are you looking to use > > > Drill > > > > > for some project? Or, just interested in how Drill works? > > > > > > > > > > Thanks, > > > > > - Paul > > > > > > > > > > [1] Towards Practical Gradual Typing: https://blog.acolyer. > > > > > org/2015/08/03/towards-practical-gradual-typing/ > > > > > [2] Is Sound Gradual Typing Dead? https://blog.acolyer. > > > > > org/2016/02/05/is-sound-gradual-typing-dead/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) < > > > > > [email protected]> wrote: > > > > > > > > > > Thanks Paul, good to know the design principals of the Drill query > > > > > execution process model. > > > > > I am very new to Drill, please bear with me. > > > > > > > > > > One more question. > > > > > As you mentioned, the schema-free processing is the key feature to > be > > > > > advantage over Spark, is there any performance consideration behind > > > this > > > > > design except the techniques of the dynamic codegen and > vectorization > > > > > computation? > > > > > > > > > > Regards, > > > > > Qiaoyi > > > > > > > > > > > > > > > ------------------------------------------------------------------ > > > > > 发件人:Paul Rogers <[email protected]> > > > > > 发送时间:2018年8月4日(星期六) 02:27 > > > > > 收件人:dev <[email protected]> > > > > > 主 题:Re: Is Drill query execution processing model just the same > idea > > > with > > > > > the Spark whole-stage codegen improvement > > > > > > > > > > Hi Qiaoyi, > > > > > As you noted, Drill and Spark have similar models -- but with > > important > > > > > differences. > > > > > Drill is schema-on-read (also called "schema less"). In particular, > > > this > > > > > means that Drill does not know the schema of the data until the > first > > > row > > > > > (actually "record batch") arrives at each operator. Once Drill sees > > > that > > > > > first batch, it has a data schema, and can generate the > corresponding > > > > code; > > > > > but only for that one operator. > > > > > The above process repeats up the fragment ("fragment" is Drill's > term > > > for > > > > > a Spark stage.) > > > > > I believe that Spark requires (or at least allows) the user to > > define a > > > > > schema up front. This is particularly true for the more modern data > > > frame > > > > > APIs. > > > > > Do you think the Spark improvement would apply to Drill's case of > > > > > determining the schema operator-by-opeartor up the DAG? > > > > > Thanks, > > > > > - Paul > > > > > > > > > > > > > > > > > > > > On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) < > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > Hi, all. > > > > > > > > > > I'm very new to Apache Drill. > > > > > > > > > > I'm quite interest in Drill query execution's implementation. > > > > > After a little bit of source code reading, I found it is built on a > > > > > processing model quite like a data-centric pushed-based style, > which > > is > > > > > very similar with the idea behind the Spark whole-stage codegen > > > > > improvement(jira ticket https://issues.apache.org/ > > > > jira/browse/SPARK-12795) > > > > > > > > > > And I wonder is there any detailed documentation about this? What's > > the > > > > > consideration behind of our design in the Drill project. : ) > > > > > > > > > > Regards, > > > > > Qiaoyi > > > > >https://medium.com/netflix-techblog/metacat-making-big- > > > > data-discoverable-and-meaningful-at-netflix-56fb36a53520 > > > >
