Avro files generated from avro-c dont work with the Java mapred implementation.
-------------------------------------------------------------------------------

                 Key: AVRO-986
                 URL: https://issues.apache.org/jira/browse/AVRO-986
             Project: Avro
          Issue Type: Bug
          Components: c, java
         Environment: avro-c 1.6.2-SNAPSHOT
avro-java 1.6.2-SNAPSHOT
hadoop 0.20.2
            Reporter: Michael Cooper
            Priority: Critical


When a file generated from the Avro-C implementation is fed into Hadoop, it 
will fail with "Block size invalid or too large for this implementation: -49".

This is caused by the sync marker, namely the one that Avro-C puts into the 
header...

The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out 
where it should read from, but this class is not particularly smart, it just 
divides the file up into equal size chunks, the first being with position 0.

So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, 
and calls
{code:title=AvroRecordReader.java}reader.sync(split.getStart());   // sync to 
start{code}
Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches 
for a sync marker....
It encounters one at position 32, the one in the header metadata map, 
"avro.sync"

No other implementations add the sync marker in the metadata map, and none read 
it from there, not even the C version.

I suggest we remove this from the header as the simplest solution.
Another solution would be to create an AvroFileSplit class in mapred that knows 
where the blocks are, and provides the correct locations in the first place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to