Jeff Plaisance created PIG-3655:
-----------------------------------

             Summary: BinStorage and InterStorage approach to record markers is 
broken
                 Key: PIG-3655
                 URL: https://issues.apache.org/jira/browse/PIG-3655
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.12.0
            Reporter: Jeff Plaisance


The way that the record readers for these storage formats seek to the first 
record in an input split is to find the byte sequence 1 2 3 110 for BinStorage 
or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence occurs in the 
data for any reason (for example the integer 16909166 stored big endian encodes 
to the byte sequence for BinStorage) other than to mark the start of a tuple it 
can cause mysterious failures in pig jobs because the record reader will try to 
decode garbage and fail.

For this approach of using an unlikely sequence to mark record boundaries, it 
is important to reduce the probability of the sequence occuring naturally in 
the data by ensuring that your record marker is sufficiently long. Hadoop 
SequenceFile uses 128 bits for this and randomly generates the sequence for 
each file (selecting a fixed, predetermined value opens up the possibility of a 
mean person intentionally sending you that value). This makes it extremely 
unlikely that collisions will occur. In the long run I think that pig should 
also be doing this.

As a quick fix it might be good to save the current position in the file before 
entering readDatum, and if an exception is thrown seek back to the saved 
position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to