[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170979#comment-14170979
 ] 

Corby Wilson commented on MAPREDUCE-6127:
-----------------------------------------

Yes, this is in Amazon S3.  We are using the AWS sdk to do client side 
encryption.

I've fixed the issue locally by forcing the 
'x-amz-meta-x-amz-unencrypted-content-length' flag on file write and then 
checking to see if it's there.

The fact that the amazon sdk doesn't set this metadata seems to be a bug in the 
aws sdk as it should have set it for me.

Our custom InputStream extends PositionedReadable and Seekable and I've 
overloaded all relevant functions to return proper codes up until EOF.  So 
{{seek(length-1); reead()}} will always work.
Since the stream is block buffered we make sure the stream is reset properly on 
reverse.

> SequenceFile crashes with encrypted files that are shorter than 
> FileSystem.getStatus(path)
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6127
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6127
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: Amazon EMR 3.0.4
>            Reporter: Corby Wilson
>
> Encrypted files are often padded to allow for proper encryption on a 2^n-bit 
> boundary.  As a result, the encrypted file might be a few bytes bigger than 
> the unencrypted file.
> We have a case where an encrypted files is 2 bytes bigger due to padding.
> When we run a HIVE job on the file to get a record count (select count(*) 
> from <table>) it runs org.apache.hadoop.mapred.SequenceFileRecordReader and 
> loads the file in through a custom FS InputStream.
> The InputStream decrypts the file  as it gets read in.  Splits are properly 
> handled as it extends both Seekable and Positioned Readable.
> When the org.apache.hadoop.io.SequenceFile class intializes it reads in the 
> file size from the FileMetadata which returns the file size of the encrypted 
> file on disk (or in this case in S3).
> However, the actual file size is 2 bytes less, so the InputStream will return 
> EOF (-1) before the SequenceFile thinks it's done.
> As a result, the SequenceFile$Reader tried to run the next->readRecordLength 
> after the file has been closed and we get a crash.
> The SequenceFile class SHOULD, instead, pay attention to the EOF marker from 
> the stream instead of the file size reported in the metadata and set the 
> 'more' flag accordingly.
> Sample stack dump from crash
> 2014-10-10 21:25:27,160 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.io.IOException: java.io.IOException: 
> java.io.EOFException
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:304)
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:220)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:433)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: java.io.EOFException
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>       at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
>       at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>       at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:302)
>       ... 11 more
> Caused by: java.io.EOFException
>       at java.io.DataInputStream.readInt(DataInputStream.java:392)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2332)
>       at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2363)
>       at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2500)
>       at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>       ... 15 more
> Sample stack dump:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to