[
https://issues.apache.org/jira/browse/AVRO-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom White updated AVRO-808:
---------------------------
Attachment: AVRO-808.patch
Here's an updated patch with a unit test and javadoc.
There's some code duplication with AvroInputFormat and AvroRecordReader which
would be good to eliminate. Do we need to do this? It could be achieved by
introducing a common superclass.
The eagle-eyed reviewer will notice that the test trims output lines, since the
job introduces a trailing tab character on lines. I couldn't find a way of
avoiding this.
# Changing the value type to NullWritable would fix the test, but makes the
input format less useful for Streaming, since the input appears with a trailing
"(null)" since this is the toString representation of NullWritable instances.
(Arguably, Streaming should be fixed to special case NullWritables to ignore
them.)
# I thought setting "mapred.textoutputformat.separator" to the empty string
would be a workaround, but I found that this is interpreted as null, and hence
the default value (a tab) is used. (When a Configuration is written to a file
and then read back empty properties are read as null, not as empty strings.
This is probably a bug - I haven't investigated further.)
# I thought changing the key to NullWritable and the value to Text might help,
by using the ignore key feature in Streaming (MAPREDUCE-1785). However, this is
not desirable for a couple of reasons: it's not available pre-0.22; and also
you lose out the sort by key, which is generally expected when using the
identity map and reduce.
Thoughts?
> Add AvroAsTextInputFormat for turning Avro Data Files to text
> -------------------------------------------------------------
>
> Key: AVRO-808
> URL: https://issues.apache.org/jira/browse/AVRO-808
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Tom White
> Assignee: Tom White
> Attachments: AVRO-808.patch, AVRO-808.patch
>
>
> This is the analog of SequenceFileAsTextInputFormat for Avro Data Files. This
> would be useful for streaming as it converts Avro data to their JSON
> representation, or the raw bytes in the case of a "bytes" schema.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira