[ 
https://issues.apache.org/jira/browse/PIG-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094798#comment-16094798
 ] 

Rohini Palaniswamy commented on PIG-3655:
-----------------------------------------

Comments on the read logic:
1) In readSyncFullyOrEOF() method, {{if (b == -1) return false}} should be done 
separately for the first byte read and not inside the while block. If we 
encounter -1 half way reading sync marker, it should throw the Corrupt data 
file exception.
2) skipUntilMarkerOrEOF() should be skipUntilMarkerOrSplitEndOrEOF(). If you 
have not found the syncmarker after reading till split end, then you need to 
bail. Currently it reads till EOF which is wasteful. With that change you can 
also get rid of  {{if (in.getPosition()-syncMarker.length > end) return false}} 
condition in nextKeyValue()
3) Could you add a test for sync marker starting exactly at second split end. 
Listing out all we need including what you already have
  -  Sync marker only at the beginning of the file. Nothing between splits.
  -  Sync marker ending exactly at the first split end
  -  Sync marker starting exactly at the second split beginning
  -  Sync marker in between two split boundaries (sync marker starting at first 
split and ending at second split)
  -  Sync marker after first split end (few bytes after beginning of the second 
split)
Can you add these as comments to each test as well to make it more clear. Test 
names are good, but a comment will help understand better.

> BinStorage and InterStorage approach to record markers is broken
> ----------------------------------------------------------------
>
>                 Key: PIG-3655
>                 URL: https://issues.apache.org/jira/browse/PIG-3655
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.8.1, 
> 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12.0, 0.11.1
>            Reporter: Jeff Plaisance
>            Assignee: Adam Szita
>         Attachments: PIG-3655.0.patch, PIG-3655.1.patch, PIG-3655.2.patch, 
> PIG-3655.3.patch
>
>
> The way that the record readers for these storage formats seek to the first 
> record in an input split is to find the byte sequence 1 2 3 110 for 
> BinStorage or 1 2 3 19-21|28-30|36-45 for InterStorage. If this sequence 
> occurs in the data for any reason (for example the integer 16909166 stored 
> big endian encodes to the byte sequence for BinStorage) other than to mark 
> the start of a tuple it can cause mysterious failures in pig jobs because the 
> record reader will try to decode garbage and fail.
> For this approach of using an unlikely sequence to mark record boundaries, it 
> is important to reduce the probability of the sequence occuring naturally in 
> the data by ensuring that your record marker is sufficiently long. Hadoop 
> SequenceFile uses 128 bits for this and randomly generates the sequence for 
> each file (selecting a fixed, predetermined value opens up the possibility of 
> a mean person intentionally sending you that value). This makes it extremely 
> unlikely that collisions will occur. In the long run I think that pig should 
> also be doing this.
> As a quick fix it might be good to save the current position in the file 
> before entering readDatum, and if an exception is thrown seek back to the 
> saved position and resume trying to find the next record marker.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to