[
https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174936#comment-13174936
]
Doug Cutting commented on AVRO-986:
-----------------------------------
+1 This patch sounds like the right way to fix this to me.
If we were to instead fix this in Java then I don't think we should try to make
the splitter smarter, since splitting is single-threaded and that's not
scalable. Rather we should make sync(0) skip over the metadata. But there
probably shouldn't be any sync markers in the metadata anyway...
> Avro files generated from avro-c dont work with the Java mapred
> implementation.
> -------------------------------------------------------------------------------
>
> Key: AVRO-986
> URL: https://issues.apache.org/jira/browse/AVRO-986
> Project: Avro
> Issue Type: Bug
> Components: c, java
> Environment: avro-c 1.6.2-SNAPSHOT
> avro-java 1.6.2-SNAPSHOT
> hadoop 0.20.2
> Reporter: Michael Cooper
> Priority: Critical
> Labels: c, hadoop, java, mapreduce
> Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch
>
>
> When a file generated from the Avro-C implementation is fed into Hadoop, it
> will fail with "Block size invalid or too large for this implementation: -49".
> This is caused by the sync marker, namely the one that Avro-C puts into the
> header...
> The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work
> out where it should read from, but this class is not particularly smart, it
> just divides the file up into equal size chunks, the first being with
> position 0.
> So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk,
> and calls
> {code:title=AvroRecordReader.java}reader.sync(split.getStart()); // sync to
> start{code}
> Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches
> for a sync marker....
> It encounters one at position 32, the one in the header metadata map,
> "avro.sync"
> No other implementations add the sync marker in the metadata map, and none
> read it from there, not even the C version.
> I suggest we remove this from the header as the simplest solution.
> Another solution would be to create an AvroFileSplit class in mapred that
> knows where the blocks are, and provides the correct locations in the first
> place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira