Ruiqi Dong created CSV-325:
------------------------------
Summary: CSVParser applies characterOffset to bytePosition, which
breaks getBytePosition() after multi-byte prefixes
Key: CSV-325
URL: https://issues.apache.org/jira/browse/CSV-325
Project: Commons CSV
Issue Type: Bug
Components: Parser
Reporter: Ruiqi Dong
*Summary*
When CSVParser.Builder#setTrackBytes(true) is enabled, and parsing starts from
the middle of a larger source, CSVParser adds characterOffset to both the
character position and the byte position. That is only correct for single-byte
prefixes. If the skipped prefix contains multi-byte UTF-8 characters,
CSVRecord.getBytePosition() is too small.
*Affected code*
File: src/main/java/org/apache/commons/csv/CSVParser.java
{code:java}
final long startCharPosition = lexer.getCharacterPosition() + characterOffset;
final long startBytePosition = lexer.getBytesRead() + characterOffset; {code}
File: src/main/java/org/apache/commons/csv/CSVRecord.java
{code:java}
/**
* Returns the starting position of this record in the source stream, measured
in bytes.
*/
public long getBytePosition() {
return bytePosition;
} {code}
*Reproducer*
Add the following test to
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testGetBytePositionWithCharacterOffsetAndMultiBytePrefix() throws
Exception {
final String code = "é,x\nb,c\n";
final long recordOffset = code.indexOf('b');
final long expectedByteOffset = "é,x\n".getBytes(UTF_8).length;
try (CSVParser parser = CSVParser.builder()
.setReader(new StringReader(code.substring((int) recordOffset)))
.setFormat(CSVFormat.DEFAULT)
.setCharset(UTF_8)
.setTrackBytes(true)
.setCharacterOffset(recordOffset)
.setRecordNumber(2)
.get()) {
final CSVRecord record = parser.nextRecord();
assertNotNull(record);
assertEquals(recordOffset, record.getCharacterPosition());
assertEquals(expectedByteOffset, record.getBytePosition());
}
}{code}
Run:
{code:java}
mvn -q
-Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionWithCharacterOffsetAndMultiBytePrefix
test {code}
Observed behavior:
{code:java}
expected: <5> but was: <4> {code}
The first record prefix is "é,x\n": * character length: 4
* UTF-8 byte length: 5
getCharacterPosition() correctly reports 4, but getBytePosition() also reports
4, even though the record starts at byte offset 5.
Expected behavior:
If byte tracking is enabled, CSVRecord.getBytePosition() should report the true
byte offset in the source stream. For the reproducer above, the record "b,c"
should start at byte offset 5, not 4.
characterOffset and byte offset are not interchangeable once the skipped prefix
can contain multi-byte characters. The current implementation: * correctly
treats characterOffset as a character-space adjustment
* incorrectly reuses the same value as a byte-space adjustment
As a result, getBytePosition() becomes unreliable for resumed parsing over
UTF-8 or other variable-width encodings.
I think CSVParser likely needs a separate byte-offset input, or it needs to
avoid applying characterOffset to byte positions when no true byte offset is
available.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)