[ 
https://issues.apache.org/jira/browse/CAMEL-12698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568885#comment-16568885
 ] 

ASF GitHub Bot commented on CAMEL-12698:
----------------------------------------

GitHub user MakotoTheKnight opened a pull request:

    https://github.com/apache/camel/pull/2454

    CAMEL-12698: Use the Stream API to read files instead of Scanner

    This change introduces a fix to the Bindy module to address what could be 
seen as surprising behavior from `java.util.Scanner` given certain Unicode code 
points.
    
    Previously, Bindy leveraged `Scanner` to read lines in a file.  However, 
`Scanner` does its own bit of whitespace parsing, and as such, given the right 
whitespace character, may not always read a complete line in.
    
    In the case identified, we came across a circumstance in which we received 
(in error) the [NEL 
character](http://www.fileformat.info/info/unicode/char/85/index.htm) in our 
data set.  Because `Scanner` honors the intent behind this character, it will 
break any line that it sees with this character in two.  This is not expected 
in Bindy; we expect to read whole lines instead.  The use of `Scanner` 
unintentionally brought this bug to light, as I'm not personally convinced that 
`Scannner` is technically *wrong*.
    
    The fix leverages a `BufferedReader` and `Stream`s instead to read lines, 
which [has the same 
expectations](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/io/BufferedReader.java#l561)
 of line termination as 
[`BufferedReader#readLine`](https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html#readLine--),
 which would be `\r`, `\n`, or `\r\n`.
    
    Note:  the peculiar exception handling inside of the stream is due to the 
fact that checked exceptions can't be propagated, so we have to wrap them in 
unchecked exceptions instead.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MakotoTheKnight/camel fix-bindy-parser

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/camel/pull/2454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2454
    
----
commit 08fe4e9092db446b07762436c9c0aa070cf680dd
Author: Jason Black <makototheknight@...>
Date:   2018-07-26T06:07:11Z

    CAMEL-12698: Use the Stream API to read files instead of Scanner

----


> Unmarshaling a CSV file with the NEL (next line) character will cause Bindy 
> to misread the entire file
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CAMEL-12698
>                 URL: https://issues.apache.org/jira/browse/CAMEL-12698
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-bindy
>    Affects Versions: 2.22.0
>            Reporter: Jason Black
>            Priority: Major
>
> I am using Apache Camel to process a lot of large CSV files, and relying on 
> Bindy to assist with unmarshalling them into POJOs.
> We have an upstream data bug which causes a record of ours to contain the 
> Unicode character 
> [NEL|http://www.fileformat.info/info/unicode/char/85/index.htm], but while 
> we're working through the cause of that, I found it curious as to what Bindy 
> is actually doing with it.  We rely on the unmarshal process to perform a 
> batch insert, and because our POJO is missing certain fields, we started 
> observing that the 
> Bindy is relying on Scanner to read lines in a large file; however, Scanner 
> itself also does some parsing of the line with the assumption that, if it 
> sees the NEL character, it will regard it as a newline character.  The modern 
> Files API does not make this distinction and reads to a newline designation 
> only (e.g \n, \r, or \r\n).
> There are two ways to fix this from what I've been able to smoke test:
>  * Change the Scanner implementation to use a delimeter of the more 
> traditional newline characters
>  * Use Java 8's Files API and stream the file in
> I would personally want to use the Files API to handle this since it's more 
> robust and capable of higher performance, but I'll explore both approaches 
> and see where I end up.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to