[
https://issues.apache.org/jira/browse/HCATALOG-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449080#comment-13449080
]
Arup Malakar commented on HCATALOG-490:
---------------------------------------
Hi Travis, don't go by the table specification as it is not real data. Both the
schema and the data in my comment is actually cooked up, which I am using for
some other testing.
The bug I am trying to highlight is when dynamic partitioning is used
HCatStorer is supposed to figure the partitions from the input data
automatically and then write onto the respective partitions.
>From
>http://incubator.apache.org/hcatalog/docs/r0.4.0/dynpartition.html#Overview :
{quote}
In cases where you want to write data to multiple partitions simultaneously,
this can be done by placing partition columns in the data and not specifying
partition values when storing the data.
A = load 'raw' using HCatLoader();
...
store Z into 'processed' using HCatStorer();
The way dynamic partitioning works is that HCatalog locates partition columns
in the data passed to it and uses the data in these columns to split the rows
across multiple partitions. (The data passed to HCatalog must have a schema
that matches the schema of the destination table and hence should always
contain partition columns.) It is important to note that partition columns
can’t contain null values or the whole process will fail.
It is also important to note that all partitions created during a single run
are part of a transaction and if any part of the process fails none of the
partitions will be added to the table.
{quote}
Now going by this I understand HCatStorer should take care of loading the input
data as long as the schema conforms and the partition columns are present and
non null. But looks like when the HCatStorere() job gets split in two map
tasks, the first one creates the directory for the partition successfully but
the second task seeing the existence of the directory throws an error instead
of putting the data inside the existing directory.
> HCatStorer() throws error when the same partition key is present in records
> in more than one tasks running as part of the same job
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HCATALOG-490
> URL: https://issues.apache.org/jira/browse/HCATALOG-490
> Project: HCatalog
> Issue Type: Bug
> Reporter: Arup Malakar
> Assignee: Arup Malakar
>
> I have a file with ~240MB data. One of the columns in input data was 'action'
> and the value is either 1 or 2.
> When I try to load it using the following script:
> {code}
> in = load '/user/malakar/page_views_20000000_0/part-00000' USING
> PigStorage(',') AS (user:chararray, timespent:int, query_term:chararray,
> ip_addr:int, estimated_revenue:int, page_info:chararray, action:int);
> STORE in into 'page_views_20000000_0' USING
> org.apache.hcatalog.pig.HCatStorer();
> {code}
> It throws the following exception:
> {quote}
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> hdfs://tasktrackerhost:8020/user/hive/warehouse/page_views_20000000_0/_DYN0.7622108853605496/action=1
> already exists at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:200)
> at
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:52)
> at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:235)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {quote}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira