[ 
https://issues.apache.org/jira/browse/CSV-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruiqi Dong updated CSV-323:
---------------------------
    Description: 
*Summary*
ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser 
later uses that state when it creates CSVRecord instances.

In the tested scenario below, the byte position of the second record is 
reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n' and the delimiter 
"[|]", the second record starts at byte offset 8, but the parser reports byte 
offset 6.
 
*Affected Code*
Files:
 * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
 * src/main/java/org/apache/commons/csv/Lexer.java
 * src/main/java/org/apache/commons/csv/CSVParser.java

{code:java}
@Override
public int read(final char[] buf, final int offset, final int length) throws 
IOException {
    if (length == 0) {
        return 0;
    }
    final int len = super.read(buf, offset, length);
    if (len > 0) {
        ...
        lastChar = buf[offset + len - 1];
    } else if (len == EOF) {
        lastChar = EOF;
    }
    position += len;
    return len;
} {code}
{code:java}
boolean isDelimiter(final int ch) throws IOException {
    ...
    final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
    isLastTokenDelimiter = count != EOF;
    return isLastTokenDelimiter;
} {code}
{code:java}
final long startBytePosition = lexer.getBytesRead() + characterOffset;
...
result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
        Objects.toString(sb, null), recordNumber, startCharPosition, 
startBytePosition); {code}
*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws 
IOException {
    final String code = "aa[|]bb\ncc[|]dd\n";
    final CSVFormat format = 
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
    try (CSVParser parser = CSVParser.builder()
            .setReader(new StringReader(code))
            .setFormat(format)
            .setCharset(StandardCharsets.UTF_8)
            .setTrackBytes(true)
            .get()) {
        final Iterator<CSVRecord> it = parser.iterator();
        final CSVRecord first = it.next();
        final CSVRecord second = it.next();

        assertEquals(0, first.getBytePosition());
        assertEquals(8, second.getBytePosition());
    }
}{code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
 test {code}
 

Expected behavior:
 # the first record starts at byte offset `0`
 # the second record should start at byte offset `8`

because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.

 

In the tested scenario: 
 # byte tracking is explicitly enabled,
 # the parser successfully returns both records,
 # but the second record receives the wrong byte offset.

So the record-position metadata is not reliable in this case.
 
 

  was:
*Summary*
ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser 
later uses that state when it creates CSVRecord instances.

In the tested scenario below, the byte position of the second record is 
reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n`'and the delimiter 
"[|]", the second record starts at byte offset 8, but the parser reports byte 
offset 6.
 
*Affected Code*
Files:
 * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
 * src/main/java/org/apache/commons/csv/Lexer.java

src/main/java/org/apache/commons/csv/CSVParser.java
{code:java}
@Override
public int read(final char[] buf, final int offset, final int length) throws 
IOException {
    if (length == 0) {
        return 0;
    }
    final int len = super.read(buf, offset, length);
    if (len > 0) {
        ...
        lastChar = buf[offset + len - 1];
    } else if (len == EOF) {
        lastChar = EOF;
    }
    position += len;
    return len;
} {code}
{code:java}
boolean isDelimiter(final int ch) throws IOException {
    ...
    final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
    isLastTokenDelimiter = count != EOF;
    return isLastTokenDelimiter;
} {code}
{code:java}
final long startBytePosition = lexer.getBytesRead() + characterOffset;
...
result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
        Objects.toString(sb, null), recordNumber, startCharPosition, 
startBytePosition); {code}

*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws 
IOException {
    final String code = "aa[|]bb\ncc[|]dd\n";
    final CSVFormat format = 
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
    try (CSVParser parser = CSVParser.builder()
            .setReader(new StringReader(code))
            .setFormat(format)
            .setCharset(StandardCharsets.UTF_8)
            .setTrackBytes(true)
            .get()) {
        final Iterator<CSVRecord> it = parser.iterator();
        final CSVRecord first = it.next();
        final CSVRecord second = it.next();

        assertEquals(0, first.getBytePosition());
        assertEquals(8, second.getBytePosition());
    }
}{code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
 test {code}
 

Expected behavior:
 # the first record starts at byte offset `0`
 # the second record should start at byte offset `8`

because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.

 

In the tested scenario: 
 # byte tracking is explicitly enabled,
 # the parser successfully returns both records,
 # but the second record receives the wrong byte offset.

So the record-position metadata is not reliable in this case.
 
 


> ExtendedBufferedReader byte tracking leads to an incorrect 
> CSVRecord.getBytePosition()
> --------------------------------------------------------------------------------------
>
>                 Key: CSV-323
>                 URL: https://issues.apache.org/jira/browse/CSV-323
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 1.14.1
>            Reporter: Ruiqi Dong
>            Priority: Major
>
> *Summary*
> ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser 
> later uses that state when it creates CSVRecord instances.
> In the tested scenario below, the byte position of the second record is 
> reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n' and the delimiter 
> "[|]", the second record starts at byte offset 8, but the parser reports byte 
> offset 6.
>  
> *Affected Code*
> Files:
>  * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
>  * src/main/java/org/apache/commons/csv/Lexer.java
>  * src/main/java/org/apache/commons/csv/CSVParser.java
> {code:java}
> @Override
> public int read(final char[] buf, final int offset, final int length) throws 
> IOException {
>     if (length == 0) {
>         return 0;
>     }
>     final int len = super.read(buf, offset, length);
>     if (len > 0) {
>         ...
>         lastChar = buf[offset + len - 1];
>     } else if (len == EOF) {
>         lastChar = EOF;
>     }
>     position += len;
>     return len;
> } {code}
> {code:java}
> boolean isDelimiter(final int ch) throws IOException {
>     ...
>     final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
>     isLastTokenDelimiter = count != EOF;
>     return isLastTokenDelimiter;
> } {code}
> {code:java}
> final long startBytePosition = lexer.getBytesRead() + characterOffset;
> ...
> result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
>         Objects.toString(sb, null), recordNumber, startCharPosition, 
> startBytePosition); {code}
> *Reproducer*
> Add the following test to 
> src/test/java/org/apache/commons/csv/CSVParserTest.java:
> {code:java}
> @Test
> void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws 
> IOException {
>     final String code = "aa[|]bb\ncc[|]dd\n";
>     final CSVFormat format = 
> CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
>     try (CSVParser parser = CSVParser.builder()
>             .setReader(new StringReader(code))
>             .setFormat(format)
>             .setCharset(StandardCharsets.UTF_8)
>             .setTrackBytes(true)
>             .get()) {
>         final Iterator<CSVRecord> it = parser.iterator();
>         final CSVRecord first = it.next();
>         final CSVRecord second = it.next();
>         assertEquals(0, first.getBytePosition());
>         assertEquals(8, second.getBytePosition());
>     }
> }{code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
>  test {code}
>  
> Expected behavior:
>  # the first record starts at byte offset `0`
>  # the second record should start at byte offset `8`
> because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.
>  
> In the tested scenario: 
>  # byte tracking is explicitly enabled,
>  # the parser successfully returns both records,
>  # but the second record receives the wrong byte offset.
> So the record-position metadata is not reliable in this case.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to