[ 
https://issues.apache.org/jira/browse/PIG-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-814:
-------------------------------

    Status: Patch Available  (was: Open)

The basic issue is that we use ctrl-A,ctrl-B,ctrl-C sequence to identify 
beginning of a record in binstorage format. We
keep parsing the inputstream till we see this sequence. After seeing this 
sequence, we send the input stream to another
function to read the tuple which represents the record. The tuple itself is 
stored in Binstorage by first having a byte
representing the tuple type(tuple marker), followed by the tuple size which is 
stored as an integer (in java
serialization format) and then the actual tuple fields each stored in java 
serialization format with a type marker
prefix.

An exception is thrown when the data itself has ctrl-A,ctrl-B,ctrl-C (maybe in 
the serialized form of a
field in some tuple). This can happen when the RandomSampleLoader (used in 
ordre by ) tries to uniformly sample 100 tuples and lands in some
part of the data which has this sequence but is not a RECORD begin sequence put 
in by BinStorage.

The fix will be to look for ctrl-A,ctrl-B,ctrl-c and additionally TUPLEMARKER 
before trying to read the tuple. This
decreases the probability of finding all these four markers in the data as well 
( and it also fixes the error for this
particular query).


> Make Binstorage more robust when data contains record markers
> -------------------------------------------------------------
>
>                 Key: PIG-814
>                 URL: https://issues.apache.org/jira/browse/PIG-814
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>
>         Attachments: PIG-814.patch
>
>
> When the inputstream for BinStorage is at a position where the data has the 
> record marker sequence, the code incorrectly assumes that it is at the 
> beginning of a record (tuple) and calls DataReaderWriter.readDatum() trying 
> to read the tuple. The problem is more likely when RandomSampleLoader (used 
> in order by implementation) skips the input stream for sampling and calls 
> Binstorage.getNext(). The code should be more robust in such cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to