[ https://issues.apache.org/jira/browse/PIG-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725831#comment-16725831 ]
Adam Szita commented on PIG-5373: --------------------------------- Attached [^PIG-5373.0.patch] which corrects the reading of sync markers using a fifo, and compares the fifo content with the expected marker. Test case attached, which verifies in a brute force way, that such prefix scenarios are handled well. [~nkollar], [~rohini] can you take a look please? > InterRecordReader might skip records if certain sync markers are used > --------------------------------------------------------------------- > > Key: PIG-5373 > URL: https://issues.apache.org/jira/browse/PIG-5373 > Project: Pig > Issue Type: Bug > Affects Versions: 0.17.0 > Reporter: Adam Szita > Assignee: Adam Szita > Priority: Major > Attachments: PIG-5373.0.patch > > > Due to bug in InterRecordReader#skipUntilMarkerOrSplitEndOrEOF(), it can > happen that sync markers are not identified while reading the interim binary > file used to hold data between jobs. > In such files sync markers are placed upon writing, which later help during > reading the data. These are random generated and it seems like that in some > rare combinations of markers and data preceding it, they cannot be not found. > This can result in reading through all the bytes (looking for the marker) and > reaching split end or EOF, and extracting no records at all. > This symptom is also observable from JobHistory stats, where if a job is > affected by this issue, will have tasks that have HDFS_BYTES_READ or > FILE_BYTES_READ about equal to the number bytes of the split, but at the same > time having MAP_INPUT_RECORDS=0 > One such (test) example is this: > {code:java} > marker: [-128, -128, 4] , data: [127, -1, 2, -128, -128, -128, 4, 1, 2, > 3]{code} > Due to a bug, such markers whose prefix overlap with the last data chunk are > not seen by the reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)