[jira] [Commented] (HCATALOG-490) HCatStorer() throws error when the same partition key is present in records in more than one tasks running as part of the same job

Arup Malakar (JIRA) Wed, 05 Sep 2012 13:17:09 -0700

    [ 
https://issues.apache.org/jira/browse/HCATALOG-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449080#comment-13449080
 ]


Arup Malakar commented on HCATALOG-490:
---------------------------------------

Hi Travis, don't go by the table specification as it is not real data. Both the 
schema and the data in my comment is actually cooked up, which I am using for 
some other testing. 
The bug I am trying to highlight is when dynamic partitioning is used 
HCatStorer is supposed to figure the partitions from the input data 
automatically and then write onto the respective partitions.

>From 
>http://incubator.apache.org/hcatalog/docs/r0.4.0/dynpartition.html#Overview :

{quote}
In cases where you want to write data to multiple partitions simultaneously, 
this can be done by placing partition columns in the data and not specifying 
partition values when storing the data.

A = load 'raw' using HCatLoader(); 
... 
store Z into 'processed' using HCatStorer(); 
The way dynamic partitioning works is that HCatalog locates partition columns 
in the data passed to it and uses the data in these columns to split the rows 
across multiple partitions. (The data passed to HCatalog must have a schema 
that matches the schema of the destination table and hence should always 
contain partition columns.) It is important to note that partition columns 
can’t contain null values or the whole process will fail.

It is also important to note that all partitions created during a single run 
are part of a transaction and if any part of the process fails none of the 
partitions will be added to the table.
{quote}

Now going by this I understand HCatStorer should take care of loading the input 
data as long as the schema conforms and the partition columns are present and 
non null. But looks like when the HCatStorere() job gets split in two map 
tasks, the first one creates the directory for the partition successfully but 
the second task seeing the existence of the directory throws an error instead 
of putting the data inside the existing directory.

                
> HCatStorer()  throws error when the same partition key is present in records 
> in more than one  tasks running as part of the same job
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HCATALOG-490
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-490
>             Project: HCatalog
>          Issue Type: Bug
>            Reporter: Arup Malakar
>            Assignee: Arup Malakar
>
> I have a file with ~240MB data. One of the columns in input data was 'action' 
> and the value is either 1 or 2. 
> When I try to load it using the following script:
> {code}
> in = load '/user/malakar/page_views_20000000_0/part-00000' USING 
> PigStorage(',') AS (user:chararray, timespent:int, query_term:chararray, 
> ip_addr:int, estimated_revenue:int, page_info:chararray, action:int);
> STORE in into 'page_views_20000000_0' USING 
> org.apache.hcatalog.pig.HCatStorer();
> {code}
> It throws the following exception:
> {quote}
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> hdfs://tasktrackerhost:8020/user/hive/warehouse/page_views_20000000_0/_DYN0.7622108853605496/action=1
>  already exists at 
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
>  at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:200)
>  at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:52)
>  at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:235) 
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>  at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>  at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>  at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>  at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269)
>  at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
>  at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249) 
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HCATALOG-490) HCatStorer() throws error when the same partition key is present in records in more than one tasks running as part of the same job

Reply via email to