Ruiqi Dong created CSV-324:
------------------------------
Summary: Lexer.isDelimiter() accepts a partial multi-character
delimiter at EOF
Key: CSV-324
URL: https://issues.apache.org/jira/browse/CSV-324
Project: Commons CSV
Issue Type: Bug
Components: Build
Affects Versions: 1.14.1
Reporter: Ruiqi Dong
*Summary*
In the tested scenario below, a truncated multi-character delimiter at EOF is
treated as a real delimiter. The relevant code path appears to be
Lexer.isDelimiter(), which accepts the delimiter once the suffix read is not
EOF, instead of requiring the entire delimiter suffix to be consumed.
*Affected code*
File: src/main/java/org/apache/commons/csv/Lexer.java
{code:java}
private final char[] delimiter;
private final char[] delimiterBuf;
boolean isDelimiter(final int ch) throws IOException {
isLastTokenDelimiter = false;
if (ch != delimiter[0]) {
return false;
}
if (delimiter.length == 1) {
isLastTokenDelimiter = true;
return true;
}
reader.peek(delimiterBuf);
for (int i = 0; i < delimiterBuf.length; i++) {
if (delimiterBuf[i] != delimiter[i + 1]) {
return false;
}
}
final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
isLastTokenDelimiter = count != EOF;
return isLastTokenDelimiter;
}{code}
*Reproducer*
Add the following test to src/test/java/org/apache/commons/csv/LexerTest.java:
{code:java}
@Test
void testPartialMultiCharacterDelimiterAtEOFIsNotConsumed() throws IOException {
final CSVFormat format =
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
try (Lexer lexer = createLexer("a[|]b[|", format)) {
assertNextToken(TOKEN, "a", lexer);
assertNextToken(EOF, "b[|", lexer);
}
} {code}
Run:
{code:java}
mvn -q
-Dtest=org.apache.commons.csv.LexerTest#testPartialMultiCharacterDelimiterAtEOFIsNotConsumed
test {code}
Observed behavior:
{code:java}
LexerTest.testPartialMultiCharacterDelimiterAtEOFIsNotConsumed:237
expected: <EOF> but was: <TOKEN> {code}
In other words, the trailing "[|" is not preserved as data. Instead, it is
treated as a delimiter and produces an extra token boundary.
Expected behavior:
The trailing "[|" should remain part of the final token because the full
delimiter "[|]" was not present.The lexer is recognizing an incomplete
delimiter as a complete field separator, which changes the parsed token stream.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)