[
https://issues.apache.org/jira/browse/HCATALOG-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446959#comment-13446959
]
Travis Crawford commented on HCATALOG-490:
------------------------------------------
I think what's happening is you're writing to a partitioned table, but your
data already has a column with the partition key. All records written in a
store statement go into the same partition.
Take a look at http://incubator.apache.org/hcatalog/docs/r0.4.0/loadstore.html
and you'll see this example:
{code}
store z into 'web_data' using
org.apache.hcatalog.pig.HCatStorer('datestamp=20110924');
{code}
Notice how the partition spec is given as an argument to the storer. Partition
columns are virtual columns added at runtime, not stored in the records
themselves.
In the case of your data I don't think you want to partition by "action", since
records in the same partition could have different options. You might try a
non-partitioned table that you load data into. If these are partitions that
arrive on some schedule, you might consider adding a datetime partition column
to distinguish them.
> HCatStorer() throws error when the same partition key is present in records
> in more than one tasks running as part of the same job
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HCATALOG-490
> URL: https://issues.apache.org/jira/browse/HCATALOG-490
> Project: HCatalog
> Issue Type: Bug
> Reporter: Arup Malakar
> Assignee: Arup Malakar
>
> I have a file with ~240MB data. One of the columns in input data was 'action'
> and the value is either 1 or 2.
> When I try to load it using the following script:
> {code}
> in = load '/user/malakar/page_views_20000000_0/part-00000' USING
> PigStorage(',') AS (user:chararray, timespent:int, query_term:chararray,
> ip_addr:int, estimated_revenue:int, page_info:chararray, action:int);
> STORE in into 'page_views_20000000_0' USING
> org.apache.hcatalog.pig.HCatStorer();
> {code}
> It throws the following exception:
> {quote}
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> hdfs://tasktrackerhost:8020/user/hive/warehouse/page_views_20000000_0/_DYN0.7622108853605496/action=1
> already exists at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
> at
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:200)
> at
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:52)
> at org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:235)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
> at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {quote}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira