Notes from HCatalog meetup held October 18, 2011 at Hortonworks.

We discussed three main issues:

1) The new Container interfaces in the package org.apache.hcatalog.mapreduce.  
It was agreed that these interfaces were not yet well defined.  We need a 
proposal to determine how HCatalog can use these interfaces to determine 
whether a given storage mechanism has a given feature.  It was also agreed that 
we need equivalent interfaces on the input side.

2) The issue of how SerDes and StorageDrivers interact was discussed.  Some 
advocated that we remove StorageDrivers and just use SerDes.  There was concern 
that SerDe is a complicated interface and this will make it hard for users to 
write connector code for HCatalog.  It was suggested that we could make 
StorageDriver implement SerDe or provide helper classes to make it easy to 
write SerDes for simple storage mechanisms.  A concrete proposal on how to 
reconcile the overlap of these two technologies is needed.

3) The desire to store additional types of metadata in HCatalog.  People have 
expressed interest in storing partition level statistics, tags for a partition, 
and lineage/provenance data in HCatalog.  We have traditionally been concerned 
that this will lead to scaling issues for a MySQL server (in terms of data size 
and potentially in terms of server response time).  We discussed the 
possibility of storing this type of data in HBase while keeping the core 
metadata in an RDBMS.  We also discussed the possibility of storing all of the 
metadata in HBase, since Data Nucleus supports HBase.  The Data Nucleus support 
of HBase is marked as experimental and their website explicitly says it should 
not yet be used in production.  There was also concern that HBase was not yet 
stable enough to store metadata with no worries of loss.  Also, the 
availability of tools to perform operations such as backup and restore on HBase 
is unclear.


Alan.

Reply via email to