[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

[email protected] (JIRA) Fri, 22 Jul 2011 11:03:24 -0700

    [ 
https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069651#comment-13069651
 ]

[email protected] commented on HCATALOG-42:
-------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1184/#review1167
-----------------------------------------------------------

trunk/build.xml
<https://reviews.apache.org/r/1184/#comment2411>

    Instead of checking it in lib, is it possible to pull this lib through ivy ?

trunk/src/java/org/apache/hcatalog/common/HCatConstants.java
<https://reviews.apache.org/r/1184/#comment2412>

    Do we need these keys? Looks like you are propagating the token already 
acquired in the first job to the second one.

trunk/src/java/org/apache/hcatalog/common/HCatConstants.java
<https://reviews.apache.org/r/1184/#comment2413>

    Will it be better to externalize this property? Otherwise one has to 
recompile hcatalog turn this off.

trunk/src/java/org/apache/hcatalog/common/HCatUtil.java
<https://reviews.apache.org/r/1184/#comment2414>

    Do we need all these methods ? Looks like you no longer need not to acquire 
token from JT.

trunk/src/java/org/apache/hcatalog/common/HCatUtil.java
<https://reviews.apache.org/r/1184/#comment2415>

    Looks little unnatural to do logging this way. We should take advantage of 
the logging library to accomplish these tasks.

trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java
<https://reviews.apache.org/r/1184/#comment2416>

    Commented code. Either remove it or uncomment it. I think these logging 
statements should be enabled.

trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java
<https://reviews.apache.org/r/1184/#comment2417>

    Commented code.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2418>

    Looks like you are not using this map later on. We should probably remove 
it.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2419>

    This I think is because you don't have a list of base committers initially. 
But it may be possible to get that list from HCatRecordWriter at this point and 
then call commitTask() on them. Calling commitTask from close() looks awkward, 
if there is a way we should avoid that.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2420>

    Instead of special casing for dynamic partitioning we should always call 
baseComitter.setupTask() here. In case of dynamic partitions, this will become 
noop from here and latter call it again from write() to do the real work.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2421>

    Can you explain the purpose of this boolean? Looks like it can be derived 
from existing variables.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2422>

    Looks like this isn't needed anymore.

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java
<https://reviews.apache.org/r/1184/#comment2423>

    Why we depend on this variable? Shouldn't we always do fs.exists() check?

trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitterPostProcessor.java
<https://reviews.apache.org/r/1184/#comment2424>

    I wonder if we are too early in introducing this interface. There is only 
one impl which is always called (in case of dynamic partitions). I think we 
should introduce interfaces only if there is a known use case for it. Further 
when other use cases do come up, good likelihood that we need to tweak the 
interface to accomodate them.

trunk/src/java/org/apache/hcatalog/pig/HCatStorer.java
<https://reviews.apache.org/r/1184/#comment2425>

    Looks like some of these are not required.

trunk/src/java/org/apache/hcatalog/pig/PigHCatUtil.java
<https://reviews.apache.org/r/1184/#comment2426>

    retrieve.. do you want to rename it to getFromUDF..

trunk/src/test/org/apache/hcatalog/pig/TestHCatStorer.java
<https://reviews.apache.org/r/1184/#comment2427>

    In all of these test cases, you should also assert metadata, by retrieving 
them from metastore and verifying as you are expecting them.

- Ashutosh

On 2011-07-22 16:09:41, Ashutosh Chauhan wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1184/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-07-22 16:09:41)
bq.  
bq.  
bq.  Review request for hcatalog.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  HCatalog-42 review request on behalf of Sushanth
bq.  
bq.  
bq.  This addresses bug HCATALOG-42.
bq.      https://issues.apache.org/jira/browse/HCATALOG-42
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/build.xml 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/ErrorType.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/HCatConstants.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/HCatUtil.java 1149353 
bq.    
trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java 
PRE-CREATION 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputCommitter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputCommitter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputFormat.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitterPostProcessor.java
 PRE-CREATION 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputFormat.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputStorageDriver.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatRecordWriter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/HCatEximStorer.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/HCatStorer.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/PigHCatUtil.java 1149353 
bq.    
trunk/src/java/org/apache/hcatalog/rcfile/RCFileMapReduceOutputFormat.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/rcfile/RCFileOutputDriver.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/HCatMapReduceTest.java 
1149353 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatDynamicPartitioned.java 
PRE-CREATION 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximInputFormat.java 
1149353 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximOutputFormat.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatNonPartitioned.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatOutputFormat.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/pig/TestHCatStorer.java 1149353 
bq.  
bq.  Diff: https://reviews.apache.org/r/1184/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Unit tests are included
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ashutosh
bq.  
bq.

> Storing across partitions(Dynamic Partitioning) from 
> HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>             Fix For: 0.2
>
>         Attachments: HCATALOG-42.10.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer 
> to them as tables and partitions. To this notion, the storage abstraction is 
> more about classifying how data is organized, rather than bothering about 
> where it is stored. A user thus then specifies partitions to be stored and 
> leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're 
> interested in reading from the table and specify various partition key value 
> combinations to prune, as if specifying a SQL-like where clause. However, 
> when it comes to writing, the abstraction is not so seamless. We still 
> require of the end user to write out data to the table 
> partition-by-partition. And these partitions require fine-grained knowledge 
> of what key-value-pairs they require, and we require this knowledge in 
> advance, and we require the writer to have already grouped the requisite data 
> accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if 
> region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, 
> region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be 
> aware of all the possible values of a key, and that needs to be maintained, 
> and modified if needed when new values are introduced, which may/may not 
> always be easy or even possible. With more partitions, scripts begin to look 
> cumbersome. And if each partition being written launches a separate HCatalog 
> store, we are increasing the load on the HCatalog server and launching more 
> jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out 
> all the partitions required from the data being written, which would allow us 
> to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Reply via email to