[ 
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568665#comment-16568665
 ] 

Paul Rogers commented on DRILL-6552:
------------------------------------

Is this under active development? Is a design document available?

Based on experience with several other tools, I wonder if Drill could build on 
its storage plugin, format plugin and UDF model to allow metadata to be seen as 
an extension, customized for various environments.

For example, Drill has its own Parquet metadata system. This PR mentions the 
Hive Metastore (HMS). Companies such as AtScale, Alation and Looker maintain 
their own meta catalogs. And, we have often wished for a simpler, per-file (or 
pre-table) system that does not need the complexity and overhead of the HMS.

This seems a perfect opportunity to define a metadata API, allowing the 
community to add a variety of implementations.

The API might allow the planner to request metadata for a table (file, 
directory) at plan time. Plan-time metadata might include schema and 
statistics. For a simple JSON file, the schema might be in the form of a JSON 
schema. For a file created in Hive, the metadata might come from HMS. But, 
regardless of the source, the data would be converted to a form that the 
planner could consume.

Then, the planner can pass physical schema along to each scan operator. For 
example, in the JSON schema case above, the planner would provide not only the 
project list (list of columns), but also metadata for those columns (data type, 
date formats, perhaps expected string lengths, etc.)

The scan operator would then use the schema, if available, to avoid the kinds 
of schema ambiguities discussed in DRILL-6035.

Further, note that if schema is available for a scan, then schema is available 
for the next downstream operator, say a filter. Since schema is available to 
the filter, the planner can predict the schema coming out of the filter, and so 
on up the DAG. As a result, with schema, the planner can produce more optimal 
plans (it will know, say, the rough record sizes, and will know data types.)

This, in turn, could allow Drill to generate and compile code at plan time, 
then distribute it to the workers, saving the cost of generating the (now 
identical) code in each of dozens of minor fragments.

Lots of good opportunities here. Look forward to seeing a design document.

> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
>                 Key: DRILL-6552
>                 URL: https://issues.apache.org/jira/browse/DRILL-6552
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Metadata
>    Affects Versions: 1.13.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>             Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would 
> enable Drill to remember previously defined schemata so Drill doesn’t have to 
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate 
> queries validation, planning and execution time. Also it increases stability 
> of Drill and allows to avoid different kind if issues: "schema change 
> Exceptions", "limit 0" optimization and so on. 
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from 
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in 
> some kind of metastore as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to