[ 
https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795859#comment-15795859
 ] 

ASF GitHub Bot commented on AVRO-1976:
--------------------------------------

GitHub user postamar opened a pull request:

    https://github.com/apache/avro/pull/182

    AVRO-1976: Add Input/OutputFormat to read/write encoded objects

    `AvroEncodedInputFormat` reads a container file input split as key-value 
pairs in which the key is the file header and the value is the decompressed 
file data block. `AvroEncodedOutputFormat`follows the same logic for writing. 
See `TestAvroEncodedInputAndOutputFormats` for usage examples.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/postamar/avro AVRO-1976

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/avro/pull/182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #182
    
----
commit 74d904b463fd7ae6acc1998350de437ea2aa8a83
Author: Marius Posta <[email protected]>
Date:   2017-01-03T18:52:52Z

    AVRO-1976: Add Input/OutputFormat to read/write encoded objects

----


> Add Input/OutputFormat to read/write encoded objects
> ----------------------------------------------------
>
>                 Key: AVRO-1976
>                 URL: https://issues.apache.org/jira/browse/AVRO-1976
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>         Environment: hadoop
>            Reporter: Marius Posta
>            Priority: Minor
>              Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In certain cases, performance of some Avro map-reduce jobs can be 
> considerably improved by de-coupling Avro encoding from actual Avro container 
> file IO.
> In my case, a complex schema (100+ record fields) and large HDFS blocks 
> resulted in Spark jobs where a lot of workers were idling while a couple of 
> them were busy decoding their input splits.Furthermore, the objects then 
> needed to be re-encoded in order to be shuffled about, which crippled 
> performance further.
> I propose the addition of an InputFormat which reads a container file input 
> split as key-value pairs in which the key is the file header and the value is 
> the decompressed file data block. Also, an OutputFormat which follows the 
> same logic for writing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to