Discussion about the metadata design

weijie tong Thu, 28 Jun 2018 08:46:11 -0700

HI all:

    As @aman ever noticed me about the roadmap of DRILL-2.0 ,which includes
the description of  the metadata design (
https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E)
, I am interested in taking the role to implement the metadata part.
Here I fire this discussion thread to know your idea about this problem.


    I have investigated some open source project about the metadata ,such
as Hive Metastore (
https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore)
,Netflix metacat, Apache Atlas,LinkedIn WhereHows(
https://github.com/linkedin/WhereHows)  ;  Except Hive Metastore, other
projects have an high abstract definition to the actual physical metadata
which will benefit to extend to add new metadata property. Hive Metastore‘s
design is to the physical metadata , also with thrift interface to
different languages, but depend on the relational database  not good to
scale and performance.   To my opinion , I would prefer Hive Metastore as
our design template or just reuse it, as we don't need to do a rich
metadata management system. Maybe we should change the backend database to
a high query performance kv store like Hbase.

   Besides the metadata interface design and the backend storage chosen, we
should also provide the random query ability . So users can calculate the
statistics like NDV to store in the metadata. Btw, maybe we can go further
to take in the Verdictdb  (https://github.com/mozafari/verdictdb) to
provide more richful approximate query processing .

Discussion about the metadata design

Reply via email to