[ 
https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174936#comment-13174936
 ] 

Doug Cutting commented on AVRO-986:
-----------------------------------

+1 This patch sounds like the right way to fix this to me.

If we were to instead fix this in Java then I don't think we should try to make 
the splitter smarter, since splitting is single-threaded and that's not 
scalable.  Rather we should make sync(0) skip over the metadata.  But there 
probably shouldn't be any sync markers in the metadata anyway...
                
> Avro files generated from avro-c dont work with the Java mapred 
> implementation.
> -------------------------------------------------------------------------------
>
>                 Key: AVRO-986
>                 URL: https://issues.apache.org/jira/browse/AVRO-986
>             Project: Avro
>          Issue Type: Bug
>          Components: c, java
>         Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
>            Reporter: Michael Cooper
>            Priority: Critical
>              Labels: c, hadoop, java, mapreduce
>         Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it 
> will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the 
> header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work 
> out where it should read from, but this class is not particularly smart, it 
> just divides the file up into equal size chunks, the first being with 
> position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, 
> and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to 
> start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches 
> for a sync marker....
> It encounters one at position 32, the one in the header metadata map, 
> "avro.sync"
> No other implementations add the sync marker in the metadata map, and none 
> read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that 
> knows where the blocks are, and provides the correct locations in the first 
> place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to