[ 
https://issues.apache.org/jira/browse/CSV-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary D. Gregory resolved CSV-329.
---------------------------------
    Fix Version/s: 1.15.0
       Resolution: Fixed

> CSVParser with trackBytes=true throws on multi-character delimiters 
> containing supplementary Unicode characters
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CSV-329
>                 URL: https://issues.apache.org/jira/browse/CSV-329
>             Project: Commons CSV
>          Issue Type: Bug
>            Reporter: Ruiqi Dong
>            Priority: Minor
>             Fix For: 1.15.0
>
>
> *Summary*
> With byte tracking enabled, parsing fails when a multi-character delimiter 
> contains a supplementary Unicode character such as an emoji.
> The parser can handle the delimiter when byte tracking is disabled. The 
> failure is caused by `ExtendedBufferedReader.read(char[], ...)` updating 
> `lastChar` before computing the encoded byte length of the read buffer. For a 
> surrogate pair read into the delimiter lookahead buffer, the low surrogate is 
> checked against an already-updated `lastChar`, so byte-length calculation 
> throws `CharacterCodingException`.
>  
>  
> *Affected code*
> File: `src/main/java/org/apache/commons/csv/Lexer.java`
> {code:java}
> boolean isDelimiter(final int ch) throws IOException {
>     ...
>     reader.peek(delimiterBuf);
>     ...
>     final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
>     isLastTokenDelimiter = count != EOF;
>     return isLastTokenDelimiter;
> } {code}
> File: `src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java`
> {code:java}
> public int read(final char[] buf, final int offset, final int length) throws 
> IOException {
>     ...
>     if (len > 0) {
>         ...
>         lastChar = buf[offset + len - 1];
>     } else if (len == EOF) {
>         lastChar = EOF;
>     }
>     if (encoder != null) {
>         this.bytesRead += getEncodedCharLength(buf, offset, len);
>     }
>     position += len;
>     return len;
> } {code}
> `getEncodedCharLength(...)` relies on the previous `lastChar` to pair a low 
> surrogate:
> {code:java}
> if (Character.isSurrogatePair(lChar, cChar)) {
>     return encoder.encode(CharBuffer.wrap(new char[] { lChar, cChar 
> })).limit();
> }
> throw new CharacterCodingException(); {code}
> *Reproducer*
> Add this test to `src/test/java/org/apache/commons/csv/CSVParserTest.java`:
> {code:java}
> @Test
> void testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter() 
> throws IOException {
>     final String delimiter = "x😀";
>     final String code = "ax😀b\n";
>     final CSVFormat format = 
> CSVFormat.DEFAULT.builder().setDelimiter(delimiter).get();
>     try (CSVParser parser = CSVParser.builder()
>             .setReader(new StringReader(code))
>             .setFormat(format)
>             .setCharset(UTF_8)
>             .setTrackBytes(true)
>             .get()) {
>         final CSVRecord record = parser.nextRecord();
>         assertNotNull(record);
>         assertEquals("a", record.get(0));
>         assertEquals("b", record.get(1));
>     }
> } {code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.csv.CSVParserTest#testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter
>  test {code}
> Observed behavior
> The test errors:
> {code:java}
> java.nio.charset.CharacterCodingException
>     at 
> org.apache.commons.csv.ExtendedBufferedReader.getEncodedCharLength(ExtendedBufferedReader.java:156)
>     at 
> org.apache.commons.csv.ExtendedBufferedReader.read(ExtendedBufferedReader.java:237)
>     at 
> org.apache.commons.io.input.UnsynchronizedBufferedReader.peek(UnsynchronizedBufferedReader.java:236)
>     at org.apache.commons.csv.Lexer.isDelimiter(Lexer.java:156){code}
> *Expected behavior* 
> Byte tracking should not change whether a valid CSV input can be parsed. The 
> record should parse as two fields:
> {code:java}
> a
> b {code}
> This is a metadata-tracking side effect that changes parser correctness. 
> Enabling byte tracking should add byte-position metadata, not make valid 
> input fail when delimiter lookahead reads a surrogate pair.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to