[
https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Busbey updated AVRO-1976:
------------------------------
Labels: beginner (was: newbie)
> Add Input/OutputFormat to read/write encoded objects
> ----------------------------------------------------
>
> Key: AVRO-1976
> URL: https://issues.apache.org/jira/browse/AVRO-1976
> Project: Avro
> Issue Type: Improvement
> Components: java
> Environment: hadoop
> Reporter: Marius Posta
> Assignee: Marius Posta
> Priority: Minor
> Labels: beginner
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> In certain cases, performance of some Avro map-reduce jobs can be
> considerably improved by de-coupling Avro encoding from actual Avro container
> file IO.
> In my case, a complex schema (100+ record fields) and large HDFS blocks
> resulted in Spark jobs where a lot of workers were idling while a couple of
> them were busy decoding their input splits.Furthermore, the objects then
> needed to be re-encoded in order to be shuffled about, which crippled
> performance further.
> I propose the addition of an InputFormat which reads a container file input
> split as key-value pairs in which the key is the file header and the value is
> the decompressed file data block. Also, an OutputFormat which follows the
> same logic for writing.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)