Notes from HCatalog meetup held October 18, 2011 at Hortonworks. We discussed three main issues:
1) The new Container interfaces in the package org.apache.hcatalog.mapreduce. It was agreed that these interfaces were not yet well defined. We need a proposal to determine how HCatalog can use these interfaces to determine whether a given storage mechanism has a given feature. It was also agreed that we need equivalent interfaces on the input side. 2) The issue of how SerDes and StorageDrivers interact was discussed. Some advocated that we remove StorageDrivers and just use SerDes. There was concern that SerDe is a complicated interface and this will make it hard for users to write connector code for HCatalog. It was suggested that we could make StorageDriver implement SerDe or provide helper classes to make it easy to write SerDes for simple storage mechanisms. A concrete proposal on how to reconcile the overlap of these two technologies is needed. 3) The desire to store additional types of metadata in HCatalog. People have expressed interest in storing partition level statistics, tags for a partition, and lineage/provenance data in HCatalog. We have traditionally been concerned that this will lead to scaling issues for a MySQL server (in terms of data size and potentially in terms of server response time). We discussed the possibility of storing this type of data in HBase while keeping the core metadata in an RDBMS. We also discussed the possibility of storing all of the metadata in HBase, since Data Nucleus supports HBase. The Data Nucleus support of HBase is marked as experimental and their website explicitly says it should not yet be used in production. There was also concern that HBase was not yet stable enough to store metadata with no worries of loss. Also, the availability of tools to perform operations such as backup and restore on HBase is unclear. Alan.
