[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175415#comment-17175415
 ] 

Nick Burch commented on TIKA-3155:
----------------------------------

If we can use quote mode we should, it will make the output from Tika nicer as 
it will stop the quotes showing in the resultant text

eg {{"test1","test 2",3,"test 4"}} would be best done as 
{{<td>test1</td><td>test 2</td><td>3</td><td>test 4</td>}} not 
{{<td>"test1</td>...}}

I don't know if we can do a lax mode with Commons CSV? 

Catching the error and retrying with no quote would be another option, but as 
the stream will have been consumed it won't work, and I'm not sure it's enough 
to enforce writing to a file for

> Parse Error while extracting CSV files
> --------------------------------------
>
>                 Key: TIKA-3155
>                 URL: https://issues.apache.org/jira/browse/TIKA-3155
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>       ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>       at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>       at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>       at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to