[jira] [Commented] (DRILL-6552) Drill Metadata management "Drill MetaStore"

Paul Rogers (JIRA) Wed, 19 Sep 2018 19:40:40 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621428#comment-16621428
 ]


Paul Rogers commented on DRILL-6552:
------------------------------------

[~okalinin], some thoughts on your very thoughtful summary.
{quote}Part of Drill MS motivation
{quote}
The three key motivations are:

1. Schema (know actual types to avoid ambiguities AKA schema changes).
 2. More efficient partition pruning/row group pruning.
 3. Better plans (NDV, histograms, etc.).

These are related, but distinct. Today we have 2 without 1 or 3. HMS can 
probably provide all 3 (maybe without histograms.) Item 1 is user provided, 
item 2 is a statement of the disk layout, and 3 is machine generated.

One thought is to separate the metadata consumer API (for the planner) from the 
producer implementation. For Parquet, you could easily implement 1 and 2 in 
Foreman itself using data from the file itself. (Be sure to consider the 
Parquet type metadata, not just the physical type.)

Then, if this works, a next step might be to use the existing Drill Parquet 
metadata JSON files, when available.

After that, experiment with distribution, such as asking each node for the set 
of files in some partition.

For CSV and JSON files, maybe allow a ".schema" (or some such) file in the data 
directory that provides 1 (the file schema).

The suggestion is, take this in small steps. The trick is determining what 
those steps might be.
{quote}ideas on distributing that work
{quote}
I would suggest that for simplicity, we look at a way to gather stats using the 
existing query framework. [~gparai], is that what you did in your work a while 
back? Basically, farm out the work as a query, then send results back to the 
foreman. Rather than steam results to the user (in Screen), gather them in some 
useful format (similar to a CTAS.) That will save the complexity of inventing 
yet another task orchestrator, with all its states, failure modes, race 
conditions, and so on.

[~gparai] said:
{quote}Any suggestions ... to see if we can speed up the existing 
implementation further.
{quote}
Suggestions fall into two buckets:

1. Reducing the scope of the work. (Do we have to scan all the data in every 
file?)
 2. Speeding up implementation.

On item 1, the sampling idea seems valid. There are many fields that have a 
distribution independent of the specific file. For example, the "User Agent" in 
a web log file probably has a moderate set of values independent of the day of 
the month. Sample one day and you've got a good estimate of all days.

Sampling is not always helpful. In the same web log time series, if we sample 
date on a single day, we'd guess there is only one value. But, across a year of 
files we'd have considerably more values.

If we are not sure how to proceed, let the user provide hints somehow. Specify 
which columns to check and the sampling frequency of each.

Heck, for a quick and dirty solution, let the user specify a guess at NDV. Even 
an order of magnitude is helpful. (Knowing the "is_active" field has two 
values, "http_status" has 20 and "date" has 1000 is much better than knowing 
nothing at all.)

On implementation, we talked about using arrays of numbers rather than maps so 
that we'd muck with just a few large vectors rather than a zillion small ones. 
Access would be by offset, not string key. I don't expect, however, much of a 
performance impact, however, relative to the cost of scanning TBs of data. Such 
a change would, however, be easier on memory and lighter on the CPU when run 
alongside other queries.

An even better approach is to keep track of which files have been scanned and 
avoid scanning them again. (At least on HDFS and S3, files are immutable.)

Would really be cool to scan files as they arrive, though this is beyond the 
scope of Drill, would need some daemon that is triggered by file arrival in 
HDFS.

That's quite a bit of complexity to tackle all at once! How might this be 
broken down into useful bitesize chunks?

> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
>                 Key: DRILL-6552
>                 URL: https://issues.apache.org/jira/browse/DRILL-6552
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Metadata
>    Affects Versions: 1.13.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>             Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would 
> enable Drill to remember previously defined schemata so Drill doesn’t have to 
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate 
> queries validation, planning and execution time. Also it increases stability 
> of Drill and allows to avoid different kind if issues: "schema change 
> Exceptions", "limit 0" optimization and so on. 
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from 
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in 
> some kind of metastore as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6552) Drill Metadata management "Drill MetaStore"

Reply via email to