Hello, I have a question to the focus of the HCatalog project, as I am not sure if I am looking at the right place, here in the HCatalog project.
My task is, to solve problems of project specific meta-data handling and data life cycles. For a research project we have a collaborativ wiki solution to edit and our dataset descriptions and procedure documentations for data analysis and data preparation. That means, we do not just have data for different time periods, we also use different algorithms to aggregate or filter the data in to different shapes for a later comarison. One possible solution would be, to write well documented Hive or Pig scripts, to do the stuff, but than we have to track all the scripts and over the time the head will explode... So there is a question, if we could map the description in our docu-system directly to metadata in Hive (not sure if Pig has such metadata as well) or if the HCatalog project would be the right place for linking our dock workspace to. Did I understand the aim of HCatalog right: It is a toolset to provide a fluent interaction between data sources and several processing systems (Pig, Hive, MR) and it is not a tool for storing metedata (e.g. by what tool was a dataset created from what raw-dataset in what time on what machine?) For a programmer these questions might not be so interesting but as one wants to optimize business use cases it would be helpful to have such metadata generated by the script or job. Based on this metadata we could compare cluster simulation results to real world (meta)data. Is there something like this known or would it be a good point to start such a project based on our (semi)manuell experiences in data life cycle tools? Best wishes, Mirko
