[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084853#comment-16084853
 ] 

Rohini Palaniswamy commented on PIG-3655:
-----------------------------------------

SequenceFile uses a sync marker every couple of records (after SYNC_INTERVAL 
bytes) to be used to determine boundaries while reading across splits unlike 
InterStorage logic which writes out record marker for every record which is 
actually bad. 

https://github.com/apache/hadoop/blob/62857be2110aaded84a93fc9891742a1271b2b85/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java#L1397-L1402

Can we get rid of the original record markers and just write the sync marker 
similar to SequenceFile? We can keep sync interval and sync marker size as 
configurable parameters.

For a default 2000 sync interval and 16 byte sync marker size, for 16 records
    - overhead would be roughly same for record sizes of 399 bytes. It would 
take 48 bytes in both cases
    - Any records with lesser size will have lesser overhead. For eg: If record 
size is 125 bytes, 3 byte record marker will have 48 bytes overhead for 16 
records while there will be only one 16 byte sync marker.
    - For records of size (especially case of bags) bigger than 399 bytes, it 
will have higher overhead than 3 byte markers.

If we use 10 byte as default sync marker size, for 10 records
    - Overhead would be roughly same for record sizes of 600 bytes. It would 
take 30 bytes in both cases.
    - Any records with lesser size will have lesser overhead. For eg: If record 
size is 200 bytes, 3 byte record marker will have 30 bytes overhead for 10 
records while there will be only one 10 byte sync marker.
   - For records of size (especially case of bags) bigger than 600 bytes, it 
will have higher overhead than 3 byte markers.

So 10 byte sync marker size should be decent for our user cases. Anyone running 
into collision can increase it to 16 bytes.

> BinStorage and InterStorage approach to record markers is broken
> ----------------------------------------------------------------
>
>                 Key: PIG-3655
>                 URL: https://issues.apache.org/jira/browse/PIG-3655
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>            Reporter: Jeff Plaisance
>            Assignee: Adam Szita
>         Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to