[
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175290#comment-17175290
]
Peter Lee commented on TIKA-3155:
---------------------------------
We can do it in _TextAndCSVParser_ like this
{code:java}
CSVFormat csvFormat =
CSVFormat.EXCEL.withDelimiter(params.getDelimiter()).withQuote(null);
{code}
I tested with Quote Mode off and it works.
> Parse Error while extracting CSV files
> --------------------------------------
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.24.1
> Reporter: Akash
> Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException:
> exception parsing the csv
> at
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198
> undefined)
> at
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280
> undefined)
> at
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280
> undefined)
> at
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143
> undefined)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209
> undefined)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record:
> java.io.IOException: (startline 39) EOF reached before encapsulated token
> finished
> at
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
> undefined)
> at
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155
> undefined)
> at
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178
> undefined)
> ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before
> encapsulated token finished
> at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288
> undefined)
> at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
> at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674
> undefined)
> at
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
> undefined)/
> {code}
> Issue is coming when we encounter double quotes in one of the cells.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)