[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

[email protected] (JIRA) Fri, 22 Jul 2011 09:10:26 -0700

    [ 
https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069598#comment-13069598
 ]


[email protected] commented on HCATALOG-42:
-------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1184/
-----------------------------------------------------------

Review request for hcatalog.


Summary
-------

HCatalog-42 review request on behalf of Sushanth


This addresses bug HCATALOG-42.
    https://issues.apache.org/jira/browse/HCATALOG-42


Diffs
-----

  trunk/build.xml 1149353 
  trunk/src/java/org/apache/hcatalog/common/ErrorType.java 1149353 
  trunk/src/java/org/apache/hcatalog/common/HCatConstants.java 1149353 
  trunk/src/java/org/apache/hcatalog/common/HCatUtil.java 1149353 
  trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java 
PRE-CREATION 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputCommitter.java 
1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java 
1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputCommitter.java 
1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputFormat.java 
1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java 1149353 
  
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitterPostProcessor.java
 PRE-CREATION 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputFormat.java 1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputStorageDriver.java 
1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatRecordWriter.java 1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java 1149353 
  trunk/src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java 1149353 
  trunk/src/java/org/apache/hcatalog/pig/HCatEximStorer.java 1149353 
  trunk/src/java/org/apache/hcatalog/pig/HCatStorer.java 1149353 
  trunk/src/java/org/apache/hcatalog/pig/PigHCatUtil.java 1149353 
  trunk/src/java/org/apache/hcatalog/rcfile/RCFileMapReduceOutputFormat.java 
1149353 
  trunk/src/java/org/apache/hcatalog/rcfile/RCFileOutputDriver.java 1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/HCatMapReduceTest.java 1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatDynamicPartitioned.java 
PRE-CREATION 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximInputFormat.java 
1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximOutputFormat.java 
1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatNonPartitioned.java 
1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatOutputFormat.java 
1149353 
  trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java 1149353 
  trunk/src/test/org/apache/hcatalog/pig/TestHCatStorer.java 1149353 

Diff: https://reviews.apache.org/r/1184/diff


Testing
-------

Unit tests are included


Thanks,

Ashutosh



> Storing across partitions(Dynamic Partitioning) from 
> HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>             Fix For: 0.2
>
>         Attachments: HCATALOG-42.10.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer 
> to them as tables and partitions. To this notion, the storage abstraction is 
> more about classifying how data is organized, rather than bothering about 
> where it is stored. A user thus then specifies partitions to be stored and 
> leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're 
> interested in reading from the table and specify various partition key value 
> combinations to prune, as if specifying a SQL-like where clause. However, 
> when it comes to writing, the abstraction is not so seamless. We still 
> require of the end user to write out data to the table 
> partition-by-partition. And these partitions require fine-grained knowledge 
> of what key-value-pairs they require, and we require this knowledge in 
> advance, and we require the writer to have already grouped the requisite data 
> accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if 
> region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, 
> region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be 
> aware of all the possible values of a key, and that needs to be maintained, 
> and modified if needed when new values are introduced, which may/may not 
> always be easy or even possible. With more partitions, scripts begin to look 
> cumbersome. And if each partition being written launches a separate HCatalog 
> store, we are increasing the load on the HCatalog server and launching more 
> jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out 
> all the partitions required from the data being written, which would allow us 
> to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Reply via email to