Hi Alex, Perhaps Parth can jump in here as he has deeper knowledge of Parquet.
My understanding is that the planning-time metadata is used for partition (directory) and row group pruning. By scanning the footer of each Parquet row group, Drill can determine whether that group contains data that the query must process. Only groups that could contain target data are included in the list of groups to be scanned by the query's scan operators. For example, suppose we do a query over a date range and a US state: SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN '2018-07-01` AND '2018-07-31'AND state = 'CA' Suppose sales is a directory that contains subdirectories partitioned by date, and that state is a dictionary-encoded field with metadata in the Parquet footer. In this case, Drill reads the directory structure to do the partition pruning at planning time. Only the matching directories are added to the list of files to be scanned at run time. Then, Drill scans the footers of each Parquet row group to find those with `state` fields that contain the value of 'CA'. (I believe the footer scanning is done only for matching partitions, but I've not looked at that code recently.) Finally, Drill passes the matching set of files to the distribution planner, which assigns files to minor fragments distributed across the Drill cluster. The result attempts to maximize locality and minimize skew. Could this have been done differently? There is a trade-off. If the work is done in the planner, then it can be very expensive for large data sets. But, on the other hand, the resulting distribution plan has minimal skew: each file that is scheduled for scanning does, in fact, need to be scanned. Suppose that Drill pushed the partition and row group filtering to the scan operator. In this case, operators that received files outside of the target date range would do nothing, while those with files in the range would need to do further work, resulting in possible skew. (One could argue that, if files are randomly distributed across nodes, then, on average, each scan operator will see roughly the same number of included and excluded files, reducing skew. This would be an interesting experiment to try.) Another approach might be to do a two-step execution. First distribute the work of doing the partition pruning and row group footer checking. Gather the results. Then, finish planning the query and distribute the execution as is done today. Here, much new code would be needed to implement this new form of task, which is probably why it has not yet been done. Might be interesting for someone to prototype that to see if we get improvement in planning performance. My guess is that, a prototype could be built that does the "pre-selection" work as a Drill query, just to see if distribution of the work helps. If that is a win, then perhaps a more robust solution could be developed from there. The final factor is Drill's crude-but-effective metadata caching mechanism. Drill reads footer and directory information for all Parquet files in a directory structure and stores it in a series of possibly-large JSON files, one in each directory. This mechanism has no good concurrency control. (This is a long-standing, hard-to-fix bug.) Centralizing access in the Foreman reduces the potential for concurrent writes on metadata refresh (but does not eliminate them since the Forman are themselves distributed.) Further, since the metadata is in one big file (in the root directory), and is in JSON (which does not support block-splitting), then it is actually not possible to distribute the planning-time selection work. I understand that the team is looking at alternative metadata implementations to avoid some of these existing issues. (There was some discussion on the dev list over the last few months.) In theory it would be possible to adopt a different metadata structure that would better support distributed selection planning. Perhaps farm out work at each directory level: in each directory, farm out work for the subdirectories (if they match the partition selection) and files. Recursively apply this down the directory tree. But, again, Drill does not have a general purpose task distribution (map/reduce) mechanism; Drill only knows how to distribute queries. Store metadata in a format that is versioned (to avoid concurrent read/write access), and block-oriented (to allow splitting the work to read a large metadata file.) It is interesting to note that several systems (including NetFlix's MetaCat, [1]) also hit bottlenecks in metadata access when metadata is stored in a relational DB, as Hive does. Hive-oriented tools have the same kind of metadata bottleneck, but for different reasons: because of the huge load placed on the Hive metastore. So, there is an opportunity for innovation in this area by observing, as you did, that metadata itself is a big data problem and requires a distributed, concurrent solution. That's my two-cents. Would be great if folks with more detailed Parquet experience could fill in the details (or correct any errors.) Thanks, - Paul [1] https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520 On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin <[email protected]> wrote: Hi Paul and Drill developers, I am sorry for slight off-topic maybe, but I noticed that Drill's foreman collects metadata of all queried files in PLANNING state (ref. class e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin. That costs a lot of time when number of queried files is substantial since MetadataGatherer task is not distributed across cluster nodes. What is the reason behind this collection? This doesn't seem to match schema-on-read philosophy, but then it's maybe just me or my setup, I am still very new to Drill. Also, I appreciate the advantage of metastore-free operations - in theory it makes Drill more reliable and less costly to run. At the same time there is Drill Metadata store project, meaning that evolution is actually shifting from metastore-free system. What are the reasons of that evolution? Or is it going to be an additional optional feature? Thanks, Best Regards, Alex On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <[email protected]> wrote: > Hi Qiaoyi, > > In general, optimal performance occurs when a system knows the schema at > the start and can fully optimize based on that schema. Think C or C++ > compilers compared with Java or Python. > > On the other hand, the JVM HotSpot optimizer has shown that one can > achieve very good performance via incremental optimization, but at the cost > of extreme runtime complexity. The benefit is that Java is much more > flexible, machine independent, and simpler than C or C++ (at least for > non-system applications.) > > Python is the other extreme: it is so dynamic that the literature has > shown that it is very difficult to optimize Python at the compiler or > runtime level. (Though, there is some interesting research on this topic. > See [1], [2].) > > Drill is somewhere in the middle. Drill does not do code generation at the > start like Impala or Spark do. Nor is it fully interpreted. Rather, Drill > is roughly like Java: code generation is done at runtime based on the > observed data types. (The JVM does machine code generation based on > observed execution patterns.) The advantage is that Drill is able to > achieve its pretty-good performance without the cost of a metadata system > to provide schema at plan time. > > So, to get the absolute fastest performance (think Impala), you must pay > the cost of a metadata system for all queries. Drill gets nearly as good > performance without the complexity of the Hive metastore -- a pretty good > tradeoff. > > If I may ask, what is your area of interest? Are you looking to use Drill > for some project? Or, just interested in how Drill works? > > Thanks, > - Paul > > [1] Towards Practical Gradual Typing: https://blog.acolyer. > org/2015/08/03/towards-practical-gradual-typing/ > [2] Is Sound Gradual Typing Dead? https://blog.acolyer. > org/2016/02/05/is-sound-gradual-typing-dead/ > > > > > > > > On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) < > [email protected]> wrote: > > Thanks Paul, good to know the design principals of the Drill query > execution process model. > I am very new to Drill, please bear with me. > > One more question. > As you mentioned, the schema-free processing is the key feature to be > advantage over Spark, is there any performance consideration behind this > design except the techniques of the dynamic codegen and vectorization > computation? > > Regards, > Qiaoyi > > > ------------------------------------------------------------------ > 发件人:Paul Rogers <[email protected]> > 发送时间:2018年8月4日(星期六) 02:27 > 收件人:dev <[email protected]> > 主 题:Re: Is Drill query execution processing model just the same idea with > the Spark whole-stage codegen improvement > > Hi Qiaoyi, > As you noted, Drill and Spark have similar models -- but with important > differences. > Drill is schema-on-read (also called "schema less"). In particular, this > means that Drill does not know the schema of the data until the first row > (actually "record batch") arrives at each operator. Once Drill sees that > first batch, it has a data schema, and can generate the corresponding code; > but only for that one operator. > The above process repeats up the fragment ("fragment" is Drill's term for > a Spark stage.) > I believe that Spark requires (or at least allows) the user to define a > schema up front. This is particularly true for the more modern data frame > APIs. > Do you think the Spark improvement would apply to Drill's case of > determining the schema operator-by-opeartor up the DAG? > Thanks, > - Paul > > > > On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) < > [email protected]> wrote: > > > Hi, all. > > I'm very new to Apache Drill. > > I'm quite interest in Drill query execution's implementation. > After a little bit of source code reading, I found it is built on a > processing model quite like a data-centric pushed-based style, which is > very similar with the idea behind the Spark whole-stage codegen > improvement(jira ticket https://issues.apache.org/jira/browse/SPARK-12795) > > And I wonder is there any detailed documentation about this? What's the > consideration behind of our design in the Drill project. : ) > > Regards, > Qiaoyi >https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520 >
