Hello,

I have a question to the focus of the HCatalog project, as I am not sure if
I am looking at the right place, here in the HCatalog project.

My task is, to solve problems of project specific meta-data handling and
data life cycles. For a research project we
have a collaborativ wiki solution to edit and our dataset descriptions and
procedure documentations for data analysis and data preparation. That
means, we do not just have data for different time periods, we also use
different algorithms to aggregate or filter the data in to different shapes
for a later comarison.

One possible solution would be, to write well documented Hive or Pig
scripts, to do the stuff, but than we have to track all the scripts and
over the time the head will explode...
So there is a question, if we could map the description in our docu-system
directly to metadata in Hive (not sure if Pig has such metadata as well) or
if the HCatalog project would be the right place for linking our dock
workspace to.

Did I understand the aim of HCatalog right: It is a toolset to provide a
fluent interaction between data sources and several processing systems
(Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what
tool was a dataset created from what raw-dataset in what time on what
machine?)

For a programmer these questions might not be so interesting but as one
wants to optimize business use cases it would be helpful to have such
metadata generated by the script or job. Based on this metadata we could
compare cluster simulation results to real world (meta)data.

Is there something like this known or would it be a good point to start
such a project based on our (semi)manuell experiences in data life cycle
tools?

Best wishes,

Mirko

Reply via email to