Corby Wilson created MAPREDUCE-6127:
---------------------------------------

             Summary: SequenceFile crashes with encrypted files
                 Key: MAPREDUCE-6127
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6127
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.2.0
         Environment: Amazon EMR 3.0.4
            Reporter: Corby Wilson


Encrypted files are often padded to allow for proper encryption on a 2^n-bit 
boundary.  As a result, the encrypted file might be a few bytes bigger than the 
unencrypted file.

We have a case where an encrypted files is 2 bytes bigger due to padding.

When we run a HIVE job on the file to get a record count (select count(*) from 
<table>) it runs org.apache.hadoop.mapred.SequenceFileRecordReader and loads 
the file in through a custom FS InputStream.
The InputStream decrypts the file  as it gets read in.  Splits a properly 
handled as it extends bot Seekable and Positioned Readable.

When the org.apache.hadoop.io.SequenceFile class intializes it reads in the 
file size from the FileMetadata which returns the file size of the encrypted 
file on disk (or in this case in S3).
However, the actual file size is 2 bytes less, so the InputStream will return 
EOF (-1) before the SequenceFile thinks it's done.
As a result, the SequenceFile$Reader tried to run the next->readRecordLength 
after the file has been closed and we get a crash.

The SequenceFile class SHOULD, instead, pay attention to the EOF marker from 
the stream instead of the file size reported in the metadata and set the 'more' 
flag accordingly.
2014-10-10 21:25:27,160 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.io.IOException: java.io.IOException: 
java.io.EOFException
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:304)
        at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:220)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:433)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: java.io.EOFException
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
        at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:302)
        ... 11 more
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2332)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2363)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2500)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
        ... 15 more
Sample stack dump:




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to