Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
--------------------------------------------------------------------------------

                 Key: HCATALOG-42
                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
             Project: HCatalog
          Issue Type: Improvement
    Affects Versions: 0.2
            Reporter: Sushanth Sowmyan


HCatalog allows people to abstract away underlying storage details and refer to 
them as tables and partitions. To this notion, the storage abstraction is more 
about classifying how data is organized, rather than bothering about where it 
is stored. A user thus then specifies partitions to be stored and leaves the 
job to HCatalog to figure out how and where it needs to do so.

When it comes to reading the data, a user is able to specify that they're 
interested in reading from the table and specify various partition key value 
combinations to prune, as if specifying a SQL-like where clause. However, when 
it comes to writing, the abstraction is not so seamless. We still require of 
the end user to write out data to the table partition-by-partition. And these 
partitions require fine-grained knowledge of what key-value-pairs they require, 
and we require this knowledge in advance, and we require the writer to have 
already grouped the requisite data accordingly before attempting to store.
For example, the following pig script illustrates this:

--
A = load 'raw' using HCatLoader(); 
... 
split Z into for_us if region='us', for_eu if region='eu', for_asia if 
region='asia'; 
store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
--

This has a major issue in that MapReduce programs and pig scripts need to be 
aware of all the possible values of a key, and that needs to be maintained, and 
modified if needed when new values are introduced, which may/may not always be 
easy or even possible. With more partitions, scripts begin to look cumbersome. 
And if each partition being written launches a separate HCatalog store, we are 
increasing the load on the HCatalog server and launching more jobs for the 
store by a factor of the number of partitions

It would be much more preferable if HCatalog were to be able to figure out all 
the partitions required from the data being written, which would allow us to 
simplify the above script into the following:

--
A = load 'raw' using HCatLoader(); 
... 
store Z into 'processed' using HCatStorage("ds=20110110"); 
--



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to