[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

[email protected] (JIRA) Fri, 22 Jul 2011 11:57:30 -0700

    [ 
https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069677#comment-13069677
 ]

[email protected] commented on HCATALOG-42:
-------------------------------------------------------

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/build.xml, line 110
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26836#file26836line110>
bq.  >
bq.  >     Instead of checking it in lib, is it possible to pull this lib 
through ivy ?

I don't know if this is exported by any particular artifact. It was part of 
hadoop-core, but as of 0.20.203/CDH3, it is not present there. Back in hadoop 
trunk and 0.20.204 onwards, it's supposed to have made a comeback. For now, 
better to keep it here for now and then pull it out once the primary hadoop 
version we depend on has it, I think.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/common/HCatConstants.java, lines 67-69
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26838#file26838line67>
bq.  >
bq.  >     Do we need these keys? Looks like you are propagating the token 
already acquired in the first job to the second one.

This is done to mimic our usage of the hive delegation token, because we need 
to refer to it to be able to cancel this.

It might be possible to experiment with pulling the credentials out of the 
jobcontext and after checking .getKind() to see if it's a MR token, and 
checking whether or not we set it as opposed to it having been set by oozie or 
other external clients, but this was the easy way of it. It's worth 
experimenting with a secure cluster to optimize this.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/common/HCatConstants.java, line 78
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26838#file26838line78>
bq.  >
bq.  >     Will it be better to externalize this property? Otherwise one has to 
recompile hcatalog turn this off.

The current idea is that this check is not really necessary, and it has to be 
an architectural discussion before we decide to enable it again. I used a 
boolean setting and if-guards instead of outright commenting out the code or 
not having it there so as to reverse it later on. If we discuss this and decide 
to have this feature enabled or configurable, we can make it configurable.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/common/HCatUtil.java, line 317
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26839#file26839line317>
bq.  >
bq.  >     Do we need all these methods ? Looks like you no longer need not to 
acquire token from JT.

I do need to acquire the token from the JobTracker and I do use it - it is 
being used even if not explicitly set - that's what the exporting of the tokens 
dir does.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > 
trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java, 
line 41
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26840#file26840line41>
bq.  >
bq.  >     Commented code. Either remove it or uncomment it. I think these 
logging statements should be enabled.

(refer above to needing to pull logging out to a separate class, then we can 
stop instantiating static LOGs altogether)

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
line 91
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line91>
bq.  >
bq.  >     Looks like you are not using this map later on. We should probably 
remove it.

Agreed, the functionality was replaced by discoverPartitions(), since we needed 
more info than this provided. This I'll remove and update the patch.

(TODO:remove baseCommitters)

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > 
trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java, 
lines 135-145
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26840#file26840line135>
bq.  >
bq.  >     Commented code.

I figured it might be useful in the future if we need to debug whether or not 
the har is valid. This I can live without.

(TODO:remove commented code)

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/common/HCatUtil.java, lines 329-366
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26839#file26839line329>
bq.  >
bq.  >     Looks little unnatural to do logging this way. We should take 
advantage of the logging library to accomplish these tasks.

Agreed on that. :)

We should pull out a bunch of these log functions into a proper log class, and 
also start using log4j proper instead of of only when we debug. Not immediately 
in the scope of this patch, I'd suggest.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
line 108
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line108>
bq.  >
bq.  >     This I think is because you don't have a list of base committers 
initially. But it may be possible to get that list from HCatRecordWriter at 
this point and then call commitTask() on them. Calling commitTask from close() 
looks awkward, if there is a way we should avoid that.

There isn't a really good way for HCatRecordWriter to communicate this to the 
HCatOutputCommitter. The current approach comes closest to what we consider the 
equivalent of the underlying committer's commitTask(), and the RecordWriter is 
a good place to put it, because that is very much in the task scope.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
lines 132-137
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line132>
bq.  >
bq.  >     Instead of special casing for dynamic partitioning we should always 
call baseComitter.setupTask() here. In case of dynamic partitions, this will 
become noop from here and latter call it again from write() to do the real work.

Again, as above, there is no way for us to know the baseCommitter in 
HCatOutputCommitter in a dynamic partitioning usecase.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
lines 179-185
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line179>
bq.  >
bq.  >     Looks like this isn't needed anymore.

This is still needed and used. Without getting the jobclient token, we cannot 
launch the har task.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
line 143
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line143>
bq.  >
bq.  >     Can you explain the purpose of this boolean? Looks like it can be 
derived from existing variables.

It serves as multiple-call guard for discoverPartitions() which can be a costly 
call to make, and calling multiple times should be avoided if it can be helped.

One thing I will agree with - it's cleaner to put this callguard inside 
discoverPartitions and have it be in only one place rather than in other 
functions that call it. I can make that change.

(TODO: move partitionsDiscovered check to inside discoverPartitions() )

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java, 
line 641
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26845#file26845line641>
bq.  >
bq.  >     Why we depend on this variable? Shouldn't we always do fs.exists() 
check?

There are two issues here.

a) The first is needing to check before a move, and fs.rename will fail in 
cases where fs.exists would have alerted us. So yes, we do always do a check, 
and we're good.
b) The second is an ability to do a check for all the moves that we would do, 
to see if any of them would have failed, before we even begin trying to move. 
This is needed especially for the case where we har, because that will not have 
the luxury of falling back on the add_partitions() failing before we do the 
move, and so, if there are any existing partitions in the directory, then it's 
needed that we know before we start trying to copy and fail mid-way. The dryRun 
flag gives us that ability to pre-check before we begin moving.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > 
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitterPostProcessor.java,
 line 32
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26846#file26846line32>
bq.  >
bq.  >     I wonder if we are too early in introducing this interface. There is 
only one impl which is always called (in case of dynamic partitions). I think 
we should introduce interfaces only if there is a known use case for it. 
Further when other use cases do come up, good likelihood that we need to tweak 
the interface to accomodate them.

This I agree with, and that's why I didn't attempt to make the usage so generic 
as to allow other PostProcessors beyond har. The reason it's there at all is 
because I wanted to keep the har code pluggable if we wanted to modify later. 
I'm okay with removing the interface and just using the HarPostProcessor itself 
as is without the @Overrides

(TODO: discuss and potentially remove)

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/pig/HCatStorer.java, lines 112-118
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26853#file26853line112>
bq.  >
bq.  >     Looks like some of these are not required.

No, as in comments above, we still do use it.

bq.  On 2011-07-22 18:01:31, Ashutosh Chauhan wrote:
bq.  > trunk/src/java/org/apache/hcatalog/pig/PigHCatUtil.java, lines 409-420
bq.  > <https://reviews.apache.org/r/1184/diff/1/?file=26854#file26854line409>
bq.  >
bq.  >     retrieve.. do you want to rename it to getFromUDF..

True. Unnecessary obfuscation. Sorry! :)

(TODO:rename)

- Sushanth

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1184/#review1167
-----------------------------------------------------------

On 2011-07-22 16:09:41, Ashutosh Chauhan wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1184/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-07-22 16:09:41)
bq.  
bq.  
bq.  Review request for hcatalog.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  HCatalog-42 review request on behalf of Sushanth
bq.  
bq.  
bq.  This addresses bug HCATALOG-42.
bq.      https://issues.apache.org/jira/browse/HCATALOG-42
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/build.xml 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/ErrorType.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/HCatConstants.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/common/HCatUtil.java 1149353 
bq.    
trunk/src/java/org/apache/hcatalog/har/HarOutputCommitterPostProcessor.java 
PRE-CREATION 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputCommitter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputCommitter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatEximOutputFormat.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitter.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputCommitterPostProcessor.java
 PRE-CREATION 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputFormat.java 
1149353 
bq.    
trunk/src/java/org/apache/hcatalog/mapreduce/HCatOutputStorageDriver.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatRecordWriter.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/HCatEximStorer.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/HCatStorer.java 1149353 
bq.    trunk/src/java/org/apache/hcatalog/pig/PigHCatUtil.java 1149353 
bq.    
trunk/src/java/org/apache/hcatalog/rcfile/RCFileMapReduceOutputFormat.java 
1149353 
bq.    trunk/src/java/org/apache/hcatalog/rcfile/RCFileOutputDriver.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/HCatMapReduceTest.java 
1149353 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatDynamicPartitioned.java 
PRE-CREATION 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximInputFormat.java 
1149353 
bq.    
trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatEximOutputFormat.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatNonPartitioned.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatOutputFormat.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java 
1149353 
bq.    trunk/src/test/org/apache/hcatalog/pig/TestHCatStorer.java 1149353 
bq.  
bq.  Diff: https://reviews.apache.org/r/1184/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Unit tests are included
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ashutosh
bq.  
bq.

> Storing across partitions(Dynamic Partitioning) from 
> HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>             Fix For: 0.2
>
>         Attachments: HCATALOG-42.10.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer 
> to them as tables and partitions. To this notion, the storage abstraction is 
> more about classifying how data is organized, rather than bothering about 
> where it is stored. A user thus then specifies partitions to be stored and 
> leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're 
> interested in reading from the table and specify various partition key value 
> combinations to prune, as if specifying a SQL-like where clause. However, 
> when it comes to writing, the abstraction is not so seamless. We still 
> require of the end user to write out data to the table 
> partition-by-partition. And these partitions require fine-grained knowledge 
> of what key-value-pairs they require, and we require this knowledge in 
> advance, and we require the writer to have already grouped the requisite data 
> accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if 
> region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, 
> region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be 
> aware of all the possible values of a key, and that needs to be maintained, 
> and modified if needed when new values are introduced, which may/may not 
> always be easy or even possible. With more partitions, scripts begin to look 
> cumbersome. And if each partition being written launches a separate HCatalog 
> store, we are increasing the load on the HCatalog server and launching more 
> jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out 
> all the partitions required from the data being written, which would allow us 
> to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Reply via email to