[ https://issues.apache.org/jira/browse/PIG-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-814: ------------------------------- Status: Patch Available (was: Open) The basic issue is that we use ctrl-A,ctrl-B,ctrl-C sequence to identify beginning of a record in binstorage format. We keep parsing the inputstream till we see this sequence. After seeing this sequence, we send the input stream to another function to read the tuple which represents the record. The tuple itself is stored in Binstorage by first having a byte representing the tuple type(tuple marker), followed by the tuple size which is stored as an integer (in java serialization format) and then the actual tuple fields each stored in java serialization format with a type marker prefix. An exception is thrown when the data itself has ctrl-A,ctrl-B,ctrl-C (maybe in the serialized form of a field in some tuple). This can happen when the RandomSampleLoader (used in ordre by ) tries to uniformly sample 100 tuples and lands in some part of the data which has this sequence but is not a RECORD begin sequence put in by BinStorage. The fix will be to look for ctrl-A,ctrl-B,ctrl-c and additionally TUPLEMARKER before trying to read the tuple. This decreases the probability of finding all these four markers in the data as well ( and it also fixes the error for this particular query). > Make Binstorage more robust when data contains record markers > ------------------------------------------------------------- > > Key: PIG-814 > URL: https://issues.apache.org/jira/browse/PIG-814 > Project: Pig > Issue Type: Bug > Affects Versions: 0.2.1 > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Fix For: 0.3.0 > > Attachments: PIG-814.patch > > > When the inputstream for BinStorage is at a position where the data has the > record marker sequence, the code incorrectly assumes that it is at the > beginning of a record (tuple) and calls DataReaderWriter.readDatum() trying > to read the tuple. The problem is more likely when RandomSampleLoader (used > in order by implementation) skips the input stream for sampling and calls > Binstorage.getNext(). The code should be more robust in such cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.