Ruiqi Dong created CSV-324:
------------------------------

             Summary: Lexer.isDelimiter() accepts a partial multi-character 
delimiter at EOF
                 Key: CSV-324
                 URL: https://issues.apache.org/jira/browse/CSV-324
             Project: Commons CSV
          Issue Type: Bug
          Components: Build
    Affects Versions: 1.14.1
            Reporter: Ruiqi Dong


*Summary*
In the tested scenario below, a truncated multi-character delimiter at EOF is 
treated as a real delimiter. The relevant code path appears to be 
Lexer.isDelimiter(), which accepts the delimiter once the suffix read is not 
EOF, instead of requiring the entire delimiter suffix to be consumed.
*Affected code*
File: src/main/java/org/apache/commons/csv/Lexer.java
{code:java}
private final char[] delimiter;
private final char[] delimiterBuf;

boolean isDelimiter(final int ch) throws IOException {
    isLastTokenDelimiter = false;
    if (ch != delimiter[0]) {
        return false;
    }
    if (delimiter.length == 1) {
        isLastTokenDelimiter = true;
        return true;
    }
    reader.peek(delimiterBuf);
    for (int i = 0; i < delimiterBuf.length; i++) {
        if (delimiterBuf[i] != delimiter[i + 1]) {
            return false;
        }
    }
    final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
    isLastTokenDelimiter = count != EOF;
    return isLastTokenDelimiter;
}{code}
*Reproducer*
Add the following test to src/test/java/org/apache/commons/csv/LexerTest.java:
{code:java}
@Test
void testPartialMultiCharacterDelimiterAtEOFIsNotConsumed() throws IOException {
    final CSVFormat format = 
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
    try (Lexer lexer = createLexer("a[|]b[|", format)) {
        assertNextToken(TOKEN, "a", lexer);
        assertNextToken(EOF, "b[|", lexer);
    }
} {code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.LexerTest#testPartialMultiCharacterDelimiterAtEOFIsNotConsumed
 test {code}
Observed behavior:
{code:java}
LexerTest.testPartialMultiCharacterDelimiterAtEOFIsNotConsumed:237
expected: <EOF> but was: <TOKEN> {code}
In other words, the trailing "[|" is not preserved as data. Instead, it is 
treated as a delimiter and produces an extra token boundary.
Expected behavior:
The trailing "[|" should remain part of the final token because the full 
delimiter "[|]" was not present.The lexer is recognizing an incomplete 
delimiter as a complete field separator, which changes the parsed token stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to