Hi Alex,

Perhaps Parth can jump in here as he has deeper knowledge of Parquet.

My understanding is that the planning-time metadata is used for partition 
(directory) and row group pruning. By scanning the footer of each Parquet row 
group, Drill can determine whether that group contains data that the query must 
process. Only groups that could contain target data are included in the list of 
groups to be scanned by the query's scan operators.

For example, suppose we do a query over a date range and a US state:

SELECT saleDate, prodId, quantity FROM `sales`WHERE dir0 BETWEEN '2018-07-01` 
AND '2018-07-31'AND state = 'CA'

Suppose sales is a directory that contains subdirectories partitioned by date, 
and that state is a dictionary-encoded field with metadata in the Parquet 
footer.

In this case, Drill reads the directory structure to do the partition pruning 
at planning time. Only the matching directories are added to the list of files 
to be scanned at run time.

Then, Drill scans the footers of each Parquet row group to find those with 
`state` fields that contain the value of 'CA'. (I believe the footer scanning 
is done only for matching partitions, but I've not looked at that code 
recently.)

Finally, Drill passes the matching set of files to the distribution planner, 
which assigns files to minor fragments distributed across the Drill cluster. 
The result attempts to maximize locality and minimize skew.

Could this have been done differently? There is a trade-off. If the work is 
done in the planner, then it can be very expensive for large data sets. But, on 
the other hand, the resulting distribution plan has minimal skew: each file 
that is scheduled for scanning does, in fact, need to be scanned.

Suppose that Drill pushed the partition and row group filtering to the scan 
operator. In this case, operators that received files outside of the target 
date range would do nothing, while those with files in the range would need to 
do further work, resulting in possible skew. (One could argue that, if files 
are randomly distributed across nodes, then, on average, each scan operator 
will see roughly the same number of included and excluded files, reducing skew. 
This would be an interesting experiment to try.)

Another approach might be to do a two-step execution. First distribute the work 
of doing the partition pruning and row group footer checking. Gather the 
results. Then, finish planning the query and distribute the execution as is 
done today. Here, much new code would be needed to implement this new form of 
task, which is probably why it has not yet been done.

Might be interesting for someone to prototype that to see if we get improvement 
in planning performance. My guess is that, a prototype could be built that does 
the "pre-selection" work as a Drill query, just to see if distribution of the 
work helps. If that is a win, then perhaps a more robust solution could be 
developed from there.

The final factor is Drill's crude-but-effective metadata caching mechanism. 
Drill reads footer and directory information for all Parquet files in a 
directory structure and stores it in a series of possibly-large JSON files, one 
in each directory. This mechanism has no good concurrency control. (This is a 
long-standing, hard-to-fix bug.) Centralizing access in the Foreman reduces the 
potential for concurrent writes on metadata refresh (but does not eliminate 
them since the Forman are themselves distributed.) Further, since the metadata 
is in one big file (in the root directory), and is in JSON (which does not 
support block-splitting), then it is actually not possible to distribute the 
planning-time selection work. I understand that the team is looking at 
alternative metadata implementations to avoid some of these existing issues. 
(There was some discussion on the dev list over the last few months.)

In theory it would be possible to adopt a different metadata structure that 
would better support distributed selection planning. Perhaps farm out work at 
each directory level: in each directory, farm out work for the subdirectories 
(if they match the partition selection) and files. Recursively apply this down 
the directory tree. But, again, Drill does not have a general purpose task 
distribution (map/reduce) mechanism; Drill only knows how to distribute 
queries. Store metadata in a format that is versioned (to avoid concurrent 
read/write access), and block-oriented (to allow splitting the work to read a 
large metadata file.)

It is interesting to note that several systems (including NetFlix's MetaCat, 
[1]) also hit bottlenecks in metadata access when metadata is stored in a 
relational DB, as Hive does. Hive-oriented tools have the same kind of metadata 
bottleneck, but for different reasons: because of the huge load placed on the 
Hive metastore. So, there is an opportunity for innovation in this area by 
observing, as you did, that metadata itself is a big data problem and  requires 
a distributed, concurrent solution.

That's my two-cents. Would be great if folks with more detailed Parquet 
experience could fill in the details (or correct any errors.)

Thanks,
- Paul


[1] 
https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
 

    On Thursday, August 9, 2018, 9:09:59 AM PDT, Oleksandr Kalinin 
<[email protected]> wrote:  
 
 Hi Paul and Drill developers,

I am sorry for slight off-topic maybe, but I noticed that Drill's foreman
collects metadata of all queried files in PLANNING state (ref. class
e.g. MetadataGatherer), at least in case of Parquet when using dfs plugin.
That costs a lot of time when number of queried files is substantial since
MetadataGatherer task is not distributed across cluster nodes. What is the
reason behind this collection? This doesn't seem to match schema-on-read
philosophy, but then it's maybe just me or my setup, I am still very new to
Drill.

Also, I appreciate the advantage of metastore-free operations - in theory
it makes Drill more reliable and less costly to run. At the same time there
is Drill Metadata store project, meaning that evolution is actually
shifting from metastore-free system. What are the reasons of that
evolution? Or is it going to be an additional optional feature?

Thanks,
Best Regards,
Alex

On Tue, Aug 7, 2018 at 10:25 PM, Paul Rogers <[email protected]>
wrote:

> Hi Qiaoyi,
>
> In general, optimal performance occurs when a system knows the schema at
> the start and can fully optimize based on that schema. Think C or C++
> compilers compared with Java or Python.
>
> On the other hand, the JVM HotSpot optimizer has shown that one can
> achieve very good performance via incremental optimization, but at the cost
> of extreme runtime complexity. The benefit is that Java is much more
> flexible, machine independent, and simpler than C or C++ (at least for
> non-system applications.)
>
> Python is the other extreme: it is so dynamic that the literature has
> shown that it is very difficult to optimize Python at the compiler or
> runtime level. (Though, there is some interesting research on this topic.
> See [1], [2].)
>
> Drill is somewhere in the middle. Drill does not do code generation at the
> start like Impala or Spark do. Nor is it fully interpreted. Rather, Drill
> is roughly like Java: code generation is done at runtime based on the
> observed data types. (The JVM does machine code generation based on
> observed execution patterns.) The advantage is that Drill is able to
> achieve its pretty-good performance without the cost of a metadata system
> to provide schema at plan time.
>
> So, to get the absolute fastest performance (think Impala), you must pay
> the cost of a metadata system for all queries. Drill gets nearly as good
> performance without the complexity of the Hive metastore -- a pretty good
> tradeoff.
>
> If I may ask, what is your area of interest? Are you looking to use Drill
> for some project? Or, just interested in how Drill works?
>
> Thanks,
> - Paul
>
> [1] Towards Practical Gradual Typing: https://blog.acolyer.
> org/2015/08/03/towards-practical-gradual-typing/
> [2] Is Sound Gradual Typing Dead? https://blog.acolyer.
> org/2016/02/05/is-sound-gradual-typing-dead/
>
>
>
>
>
>
>
>    On Sunday, August 5, 2018, 11:36:52 PM PDT, 丁乔毅(智乔) <
> [email protected]> wrote:
>
>  Thanks Paul, good to know the design principals of the Drill query
> execution process model.
> I am very new to Drill, please bear with me.
>
> One more question.
> As you mentioned, the schema-free processing is the key feature to be
> advantage over Spark, is there any performance consideration behind this
> design except the techniques of the dynamic codegen and vectorization
> computation?
>
> Regards,
> Qiaoyi
>
>
> ------------------------------------------------------------------
> 发件人:Paul Rogers <[email protected]>
> 发送时间:2018年8月4日(星期六) 02:27
> 收件人:dev <[email protected]>
> 主 题:Re: Is Drill query execution processing model just the same idea with
> the Spark whole-stage codegen improvement
>
> Hi Qiaoyi,
> As you noted, Drill and Spark have similar models -- but with important
> differences.
> Drill is schema-on-read (also called "schema less"). In particular, this
> means that Drill does not know the schema of the data until the first row
> (actually "record batch") arrives at each operator. Once Drill sees that
> first batch, it has a data schema, and can generate the corresponding code;
> but only for that one operator.
> The above process repeats up the fragment ("fragment" is Drill's term for
> a Spark stage.)
> I believe that Spark requires (or at least allows) the user to define a
> schema up front. This is particularly true for the more modern data frame
> APIs.
> Do you think the Spark improvement would apply to Drill's case of
> determining the schema operator-by-opeartor up the DAG?
> Thanks,
> - Paul
>
>
>
>    On Friday, August 3, 2018, 8:57:29 AM PDT, 丁乔毅(智乔) <
> [email protected]> wrote:
>
>
> Hi, all.
>
> I'm very new to Apache Drill.
>
> I'm quite interest in Drill query execution's implementation.
> After a little bit of source code reading, I found it is built on a
> processing model quite like a data-centric pushed-based style, which is
> very similar with the idea behind the Spark whole-stage codegen
> improvement(jira ticket https://issues.apache.org/jira/browse/SPARK-12795)
>
> And I wonder is there any detailed documentation about this? What's the
> consideration behind of our design in the Drill project. : )
>
> Regards,
> Qiaoyi
>https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
>  

Reply via email to