[
https://issues.apache.org/jira/browse/CSV-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory updated CSV-324:
--------------------------------
Assignee: Gary D. Gregory
> Lexer.isDelimiter() accepts a partial multi-character delimiter at EOF
> ----------------------------------------------------------------------
>
> Key: CSV-324
> URL: https://issues.apache.org/jira/browse/CSV-324
> Project: Commons CSV
> Issue Type: Bug
> Components: Build
> Affects Versions: 1.14.1
> Reporter: Ruiqi Dong
> Assignee: Gary D. Gregory
> Priority: Minor
>
> *Summary*
> In the tested scenario below, a truncated multi-character delimiter at EOF is
> treated as a real delimiter. The relevant code path appears to be
> Lexer.isDelimiter(), which accepts the delimiter once the suffix read is not
> EOF, instead of requiring the entire delimiter suffix to be consumed.
> *Affected code*
> File: src/main/java/org/apache/commons/csv/Lexer.java
> {code:java}
> private final char[] delimiter;
> private final char[] delimiterBuf;
> boolean isDelimiter(final int ch) throws IOException {
> isLastTokenDelimiter = false;
> if (ch != delimiter[0]) {
> return false;
> }
> if (delimiter.length == 1) {
> isLastTokenDelimiter = true;
> return true;
> }
> reader.peek(delimiterBuf);
> for (int i = 0; i < delimiterBuf.length; i++) {
> if (delimiterBuf[i] != delimiter[i + 1]) {
> return false;
> }
> }
> final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
> isLastTokenDelimiter = count != EOF;
> return isLastTokenDelimiter;
> }{code}
> *Reproducer*
> Add the following test to src/test/java/org/apache/commons/csv/LexerTest.java:
> {code:java}
> @Test
> void testPartialMultiCharacterDelimiterAtEOFIsNotConsumed() throws
> IOException {
> final CSVFormat format =
> CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
> try (Lexer lexer = createLexer("a[|]b[|", format)) {
> assertNextToken(TOKEN, "a", lexer);
> assertNextToken(EOF, "b[|", lexer);
> }
> } {code}
> Run:
> {code:java}
> mvn -q
> -Dtest=org.apache.commons.csv.LexerTest#testPartialMultiCharacterDelimiterAtEOFIsNotConsumed
> test {code}
> Observed behavior:
> {code:java}
> LexerTest.testPartialMultiCharacterDelimiterAtEOFIsNotConsumed:237
> expected: <EOF> but was: <TOKEN> {code}
> In other words, the trailing "[|" is not preserved as data. Instead, it is
> treated as a delimiter and produces an extra token boundary.
> Expected behavior:
> The trailing "[|" should remain part of the final token because the full
> delimiter "[|]" was not present.
>
> The lexer is recognizing an incomplete delimiter as a complete field
> separator, which changes the parsed token stream.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)