[ 
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620742#comment-16620742
 ] 

Oleksandr Kalinin commented on DRILL-6552:
------------------------------------------

[~vitalii] [~paul-rogers]
Summarizing some ideas from recent related discussions in mailing list with 
your input.

Part of Drill MS motivation is metadata handling optimization (e.g. "Avoid 
stage of discovering metadata for every query during execution", "Produce 
different kinds of validation before execution stage", "Decrease planning time 
spent for collecting information about files and partitions"). It is not entire 
scope but seems to be rather bulk part of it. It looks like 
scalability/performance issue is largely related to dealing with metadata is 
caused by metadata collection / processing done by single drillbit (foreman) in 
non-distributed fashion. If metadata collection and validation work could be 
distributed across cluster nodes it would become horizontally scalable and thus 
the issue (hopefully) could be eliminated.

Some ideas on distributing that work were discussed in the mailing list. IMHO 
simplistic way of work distribution could be sufficient to tackle this 
particular issue: map-only, non-fault tolerant type of of map-reduce. In simple 
words, split work into chunks and send them to all or some active drillbits for 
processing, collect/aggregate the replies and proceed. If any interim error or 
timeout occurs, it should be OK to just fail the query - there seems to be no 
strong need for fault tolerance because the cost of operation is low.

It was discussed that integrating this new interaction into existing RPC comms 
code could represent major challenge. If that turns to be the case I wonder if 
it could be considered to create new communication channel for this and 
possibly other future purposes. It could still be more compact than introducing 
brand new component and would allow to maintain current operational simplicity 
of Drill - single component/process type only, no centralization, no Hive/RDBMS 
dependency, etc.

I intended to create some small prototype of testing potential performance gain 
as suggested by Paul (I am pretty sure it is significant as this type of work 
is perfect for parallelism), but unfortunately did not have time up to now...

 

> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
>                 Key: DRILL-6552
>                 URL: https://issues.apache.org/jira/browse/DRILL-6552
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Metadata
>    Affects Versions: 1.13.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>             Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would 
> enable Drill to remember previously defined schemata so Drill doesn’t have to 
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate 
> queries validation, planning and execution time. Also it increases stability 
> of Drill and allows to avoid different kind if issues: "schema change 
> Exceptions", "limit 0" optimization and so on. 
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from 
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in 
> some kind of metastore as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to