[ 
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581724#comment-16581724
 ] 

Parth Chandra commented on DRILL-6552:
--------------------------------------

Some thoughts I had jotted down on this topic a while ago. (These might be more 
than what people are thinking for the first cut but I figured I'd throw them in 
to the discussion anyway).

There are three parts to this problem : 

1) The design of the schema of the metastore itself ( the schema). 

2) The storage of the metadata to the metastore (the store). 

3) The metadata APIs. 

As an example the metadata cache for Parquet is a metadata store that defines 
Parquet files, their schema, the rowgroups within the files, and statistics for 
the rowgroups. There have been at least three versions of the information kept  
for Parquet files; i.e. the schema and APIs have had at least three versions. 
The storage layer is simply files on hdfs. This solution was easy to develop, 
but has shortcomings when it comes to allowing concurrent access and updates, 
and also does not scale too well for directories that may have tens of 
thousands of files. 

The schema must be versioned allowing multiple versions of Drill to access data 
from the same metastore. This means that not only must the schema be versioned, 
but the metastore API must provide backward and forward compatibility.

The schema representation must be extensible allowing new objects to be added 
without requiring additional code in the metastore access layer. For instance a 
new storage plugin may have properties that we are not aware of and need to be 
stored and retrieved.

  An initial list of items that can be stored -

    Schemas (tables, columns, types). Types may be complex.

    Files, file splits and locality information. 

    Table partitioning information. 

    Column statistics. 

    UDF and built-in function definitions

    Storage plugin configurations

    Runtime metadata (query profiles)

The store must allow concurrent reads and writes. Reads are likely to be orders 
of magnitude more than writes. A common use case is of Parquet files produced 
by an external source being added to a subdirectory every day or every hour 
while the parent directory (and therefore all subdirectories under it) is being 
queried by end users. 

The implementation must scale and be able to store metadata from hundreds of 
thousands of data files and hundreds of concurrent reads of the metadata.

It is highly desirable that the Drill planner and execution engine be able to 
access the metadata without knowledge of the underlying store. The underlying 
store may be the file system, a relational db, a no-sql db, another metastore 
or may even be in-memory. This necessarily implies an API design to separate 
out the underlying storage. Note that this also allows an existing metastore 
(like the Hive meta store) to be used.

The initial implementation of the metastore may need to have support for more 
than one implementation of the underlying store.  

Since accessing the metastore is a critical operation, the metastore 
implementation must not have a single point of failure. 
 
 

> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
>                 Key: DRILL-6552
>                 URL: https://issues.apache.org/jira/browse/DRILL-6552
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Metadata
>    Affects Versions: 1.13.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>             Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would 
> enable Drill to remember previously defined schemata so Drill doesn’t have to 
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate 
> queries validation, planning and execution time. Also it increases stability 
> of Drill and allows to avoid different kind if issues: "schema change 
> Exceptions", "limit 0" optimization and so on. 
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from 
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in 
> some kind of metastore as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to