[ 
https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.10.patch

One more patch update, 
 + isEmpty() check on partitionsToAdd to explicitly check for empty writes
 + spelling correction from STRFROM to STRFORM


> Storing across partitions(Dynamic Partitioning) from 
> HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>             Fix For: 0.2
>
>         Attachments: HCATALOG-42.10.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer 
> to them as tables and partitions. To this notion, the storage abstraction is 
> more about classifying how data is organized, rather than bothering about 
> where it is stored. A user thus then specifies partitions to be stored and 
> leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're 
> interested in reading from the table and specify various partition key value 
> combinations to prune, as if specifying a SQL-like where clause. However, 
> when it comes to writing, the abstraction is not so seamless. We still 
> require of the end user to write out data to the table 
> partition-by-partition. And these partitions require fine-grained knowledge 
> of what key-value-pairs they require, and we require this knowledge in 
> advance, and we require the writer to have already grouped the requisite data 
> accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if 
> region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, 
> region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be 
> aware of all the possible values of a key, and that needs to be maintained, 
> and modified if needed when new values are introduced, which may/may not 
> always be easy or even possible. With more partitions, scripts begin to look 
> cumbersome. And if each partition being written launches a separate HCatalog 
> store, we are increasing the load on the HCatalog server and launching more 
> jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out 
> all the partitions required from the data being written, which would allow us 
> to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to