Prasanth J created HIVE-6326:
--------------------------------

             Summary: Split generation in ORC may generate wrong split 
boundaries because of unaccounted padded bytes
                 Key: HIVE-6326
                 URL: https://issues.apache.org/jira/browse/HIVE-6326
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.13.0
            Reporter: Prasanth J
            Assignee: Prasanth J


HIVE-5091 added padding to ORC files to avoid ORC stripes straddling HDFS 
blocks. The length of this padded bytes are not stored in stripe information. 
OrcInputFormat.getSplits() uses stripeInformation.getLength() for split 
computation. stripeInformation.getLength() is sum of index length, data length 
and stripe footer length. It does not account for the length of padded bytes 
which may result in wrong split boundary.

The fix for this is to use the offset of next stripe as the length of current 
stripe which includes the padded bytes as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to