[
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620742#comment-16620742
]
Oleksandr Kalinin commented on DRILL-6552:
------------------------------------------
[~vitalii] [~paul-rogers]
Summarizing some ideas from recent related discussions in mailing list with
your input.
Part of Drill MS motivation is metadata handling optimization (e.g. "Avoid
stage of discovering metadata for every query during execution", "Produce
different kinds of validation before execution stage", "Decrease planning time
spent for collecting information about files and partitions"). It is not entire
scope but seems to be rather bulk part of it. It looks like
scalability/performance issue is largely related to dealing with metadata is
caused by metadata collection / processing done by single drillbit (foreman) in
non-distributed fashion. If metadata collection and validation work could be
distributed across cluster nodes it would become horizontally scalable and thus
the issue (hopefully) could be eliminated.
Some ideas on distributing that work were discussed in the mailing list. IMHO
simplistic way of work distribution could be sufficient to tackle this
particular issue: map-only, non-fault tolerant type of of map-reduce. In simple
words, split work into chunks and send them to all or some active drillbits for
processing, collect/aggregate the replies and proceed. If any interim error or
timeout occurs, it should be OK to just fail the query - there seems to be no
strong need for fault tolerance because the cost of operation is low.
It was discussed that integrating this new interaction into existing RPC comms
code could represent major challenge. If that turns to be the case I wonder if
it could be considered to create new communication channel for this and
possibly other future purposes. It could still be more compact than introducing
brand new component and would allow to maintain current operational simplicity
of Drill - single component/process type only, no centralization, no Hive/RDBMS
dependency, etc.
I intended to create some small prototype of testing potential performance gain
as suggested by Paul (I am pretty sure it is significant as this type of work
is perfect for parallelism), but unfortunately did not have time up to now...
> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
> Key: DRILL-6552
> URL: https://issues.apache.org/jira/browse/DRILL-6552
> Project: Apache Drill
> Issue Type: New Feature
> Components: Metadata
> Affects Versions: 1.13.0
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
> Priority: Major
> Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would
> enable Drill to remember previously defined schemata so Drill doesn’t have to
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate
> queries validation, planning and execution time. Also it increases stability
> of Drill and allows to avoid different kind if issues: "schema change
> Exceptions", "limit 0" optimization and so on.
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in
> some kind of metastore as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)