[ 
https://issues.apache.org/jira/browse/NIFI-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871136#comment-15871136
 ] 

Joseph Witt commented on NIFI-3495:
-----------------------------------

i have verified that this corrects the issue observed.  Before this patch data 
from this site when split would split very incorrectly.  After this patch it 
appears to split the results perfectly.

http://standards.ieee.org/develop/regauth/oui/oui.csv

It still needs a code review

> TextLineDemarcator sets the wrong index when read ahead is performed in isEol 
> operation
> ---------------------------------------------------------------------------------------
>
>                 Key: NIFI-3495
>                 URL: https://issues.apache.org/jira/browse/NIFI-3495
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Oleg Zhurakousky
>            Assignee: Oleg Zhurakousky
>            Priority: Critical
>             Fix For: 1.2.0
>
>
> This condition is very rare. It only occurs when read ahead (call to 
> _fill()_)  is made inside of the _isEol_ operation which essentially sets the 
> new index which then is reset inside of the main _nextOffsetInfo_ operation. 
> So the fix is to basically monitor if _isEol_ had to perform read ahead and 
> if it did do not reset the index.
> More details.
> While this component is modeled after standard Java BufferedReader which 
> simply reads and returns lines (delimited by CR or LF or both), this reader 
> also holds the information about how each line terminated (i.e., EOF, or CR 
> or LF or CR and LF) returning it to the caller as OffsetInfo. 
> So for example if you have a record "foo\r\nbar" and you read it with 
> BuffereReader you will get 'foo' and 'bar'. However you will not know that 
> between the two tokens there was CR and LF and therefore will not be able to 
> restore (if need to) the record to its original state. The TextLineDemarcator 
> will return OffsetInfo which holds the delimiter and other information.
> So, to accomplish the above every time we see CR (13) we need to peek at the 
> next byte and see if its LF(10). When at the end of the buffer such peek 
> becomes complicated since we need to read more data and so we did, but didn't 
> handle index properly essentially setting it back to the old value when the 
> new one was set inside of the fill().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to