HI all: As @aman ever noticed me about the roadmap of DRILL-2.0 ,which includes the description of the metadata design ( https://lists.apache.org/thread.html/74cf48dd78d323535dc942c969e72008884e51f8715f4a20f6f8fb66@%3Cdev.drill.apache.org%3E) , I am interested in taking the role to implement the metadata part. Here I fire this discussion thread to know your idea about this problem.
I have investigated some open source project about the metadata ,such as Hive Metastore ( https://cwiki.apache.org/confluence/display/Hive/Design#Design-Metastore) ,Netflix metacat, Apache Atlas,LinkedIn WhereHows( https://github.com/linkedin/WhereHows) ; Except Hive Metastore, other projects have an high abstract definition to the actual physical metadata which will benefit to extend to add new metadata property. Hive Metastore‘s design is to the physical metadata , also with thrift interface to different languages, but depend on the relational database not good to scale and performance. To my opinion , I would prefer Hive Metastore as our design template or just reuse it, as we don't need to do a rich metadata management system. Maybe we should change the backend database to a high query performance kv store like Hbase. Besides the metadata interface design and the backend storage chosen, we should also provide the random query ability . So users can calculate the statistics like NDV to store in the metadata. Btw, maybe we can go further to take in the Verdictdb (https://github.com/mozafari/verdictdb) to provide more richful approximate query processing .