Marius Posta created AVRO-1976: ---------------------------------- Summary: Add Input/OutputFormat to read/write encoded objects Key: AVRO-1976 URL: https://issues.apache.org/jira/browse/AVRO-1976 Project: Avro Issue Type: Improvement Components: java Environment: hadoop Reporter: Marius Posta Priority: Minor
In certain cases, performance of some Avro map-reduce jobs can be considerably improved by de-coupling Avro encoding from actual Avro container file IO. In my case, a complex schema (100+ record fields) and large HDFS blocks resulted in Spark jobs where a lot of workers were idling while a couple of them were busy decoding their input splits.Furthermore, the objects then needed to be re-encoded in order to be shuffled about, which crippled performance further. I propose the addition of an InputFormat which reads a container file input split as key-value pairs in which the key is the file header and the value is the decompressed file data block. Also, an OutputFormat which follows the same logic for writing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)