Marius Posta created AVRO-1976:
----------------------------------

             Summary: Add Input/OutputFormat to read/write encoded objects
                 Key: AVRO-1976
                 URL: https://issues.apache.org/jira/browse/AVRO-1976
             Project: Avro
          Issue Type: Improvement
          Components: java
         Environment: hadoop
            Reporter: Marius Posta
            Priority: Minor


In certain cases, performance of some Avro map-reduce jobs can be considerably 
improved by de-coupling Avro encoding from actual Avro container file IO.

In my case, a complex schema (100+ record fields) and large HDFS blocks 
resulted in Spark jobs where a lot of workers were idling while a couple of 
them were busy decoding their input splits.Furthermore, the objects then needed 
to be re-encoded in order to be shuffled about, which crippled performance 
further.

I propose the addition of an InputFormat which reads a container file input 
split as key-value pairs in which the key is the file header and the value is 
the decompressed file data block. Also, an OutputFormat which follows the same 
logic for writing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to