[
https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------
Attachment: HCATALOG-42.10.patch
One more patch update,
+ isEmpty() check on partitionsToAdd to explicitly check for empty writes
+ spelling correction from STRFROM to STRFORM
> Storing across partitions(Dynamic Partitioning) from
> HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
> Key: HCATALOG-42
> URL: https://issues.apache.org/jira/browse/HCATALOG-42
> Project: HCatalog
> Issue Type: Improvement
> Affects Versions: 0.2
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
> Fix For: 0.2
>
> Attachments: HCATALOG-42.10.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer
> to them as tables and partitions. To this notion, the storage abstraction is
> more about classifying how data is organized, rather than bothering about
> where it is stored. A user thus then specifies partitions to be stored and
> leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're
> interested in reading from the table and specify various partition key value
> combinations to prune, as if specifying a SQL-like where clause. However,
> when it comes to writing, the abstraction is not so seamless. We still
> require of the end user to write out data to the table
> partition-by-partition. And these partitions require fine-grained knowledge
> of what key-value-pairs they require, and we require this knowledge in
> advance, and we require the writer to have already grouped the requisite data
> accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader();
> ...
> split Z into for_us if region='us', for_eu if region='eu', for_asia if
> region='asia';
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us");
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu");
> store for_asia into 'processed' using HCatStorage("ds=20110110,
> region=asia");
> --
> This has a major issue in that MapReduce programs and pig scripts need to be
> aware of all the possible values of a key, and that needs to be maintained,
> and modified if needed when new values are introduced, which may/may not
> always be easy or even possible. With more partitions, scripts begin to look
> cumbersome. And if each partition being written launches a separate HCatalog
> store, we are increasing the load on the HCatalog server and launching more
> jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out
> all the partitions required from the data being written, which would allow us
> to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader();
> ...
> store Z into 'processed' using HCatStorage("ds=20110110");
> --
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira