[ https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Busbey updated AVRO-1976: ------------------------------ Status: Patch Available (was: Open) > Add Input/OutputFormat to read/write encoded objects > ---------------------------------------------------- > > Key: AVRO-1976 > URL: https://issues.apache.org/jira/browse/AVRO-1976 > Project: Avro > Issue Type: Improvement > Components: java > Environment: hadoop > Reporter: Marius Posta > Assignee: Marius Posta > Priority: Minor > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In certain cases, performance of some Avro map-reduce jobs can be > considerably improved by de-coupling Avro encoding from actual Avro container > file IO. > In my case, a complex schema (100+ record fields) and large HDFS blocks > resulted in Spark jobs where a lot of workers were idling while a couple of > them were busy decoding their input splits.Furthermore, the objects then > needed to be re-encoded in order to be shuffled about, which crippled > performance further. > I propose the addition of an InputFormat which reads a container file input > split as key-value pairs in which the key is the file header and the value is > the decompressed file data block. Also, an OutputFormat which follows the > same logic for writing. -- This message was sent by Atlassian JIRA (v6.4.14#64029)