[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

David Chen (JIRA) Fri, 15 Aug 2014 16:34:47 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099343#comment-14099343
 ]


David Chen commented on HIVE-4329:
----------------------------------

Hi Sushanth,

Thank you for taking a look at this ticket.

I agree that it would be ideal to get Hive to a point where a unified 
StorageHandler interface can replace the current use of HiveOutputFormat and 
FileSinkOperator.RecordWriter (which should really be named HiveRecordWriter). 
However, that is a larger, more long-term undertaking whereas this ticket is to 
fix the fact that it is currently not possible to write using HCatalog for 
storage formats whose (Hive)OutputFormats that only implement 
getHiveRecordWriter and not getRecordWriter.

The new tests I added as part of HIVE-7286 have demonstrated that only solving 
the type compatibility issue mentioned earlier in this ticket is not 
sufficient. The type error for AvroContainerOutputFormat masks the real issue 
which is that AvroContainerOutputFormat's getRecordWriter (as with 
ParquetHiveOutputFormat's) does nothing but throws an exception, which says 
that "this method should not be called."

This is why my fix for this issue is taking this approach, which is based on 
the approach taken by core Hive. To my understanding, Hive accepts both MR 
OutputFormats as well as HiveOutputFormats but ends up calling 
getHiveRecordWriter in both cases. For the case of MR OutputFormats, Hive 
detects that it is not a HiveOutputFormat and wraps it using 
HivePassThroughOutputFormat.

My understanding is that your main concern is that this patch may be turning 
HCatOutputFormat into a HiveOutputFormat. However, this is not the case. This 
patch does not change the HCatalog interface; it changes the way that 
HCatOutputFormat wraps the underlying OutputFormat so that it can properly 
handle HiveOutputFormats, which is required to make it possible to write using 
HCatalog for Avro and Parquet.

> HCatalog should use getHiveRecordWriter rather than getRecordWriter
> -------------------------------------------------------------------
>
>                 Key: HIVE-4329
>                 URL: https://issues.apache.org/jira/browse/HIVE-4329
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Serializers/Deserializers
>    Affects Versions: 0.14.0
>         Environment: discovered in Pig, but it looks like the root cause 
> impacts all non-Hive users
>            Reporter: Sean Busbey
>            Assignee: David Chen
>         Attachments: HIVE-4329.0.patch
>
>
> Attempting to write to a HCatalog defined table backed by the AvroSerde fails 
> with the following stacktrace:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be 
> cast to org.apache.hadoop.io.LongWritable
>       at 
> org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat$1.write(AvroContainerOutputFormat.java:84)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:253)
>       at 
> org.apache.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:53)
>       at 
> org.apache.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:242)
>       at org.apache.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:52)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>       at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:559)
>       at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> {code}
> The proximal cause of this failure is that the AvroContainerOutputFormat's 
> signature mandates a LongWritable key and HCat's FileRecordWriterContainer 
> forces a NullWritable. I'm not sure of a general fix, other than redefining 
> HiveOutputFormat to mandate a WritableComparable.
> It looks like accepting WritableComparable is what's done in the other Hive 
> OutputFormats, and there's no reason AvroContainerOutputFormat couldn't also 
> be changed, since it's ignoring the key. That way fixing things so 
> FileRecordWriterContainer can always use NullWritable could get spun into a 
> different issue?
> The underlying cause for failure to write to AvroSerde tables is that 
> AvroContainerOutputFormat doesn't meaningfully implement getRecordWriter, so 
> fixing the above will just push the failure into the placeholder RecordWriter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4329) HCatalog should use getHiveRecordWriter rather than getRecordWriter

Reply via email to