[ 
https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780854#comment-16780854
 ] 

Tim Allison edited comment on TIKA-2833 at 3/4/19 4:06 PM:
-----------------------------------------------------------

The real test will be against the full corpus to see how many false positives 
we have for files identified as csv but are actually plain text.

In addition to adding a first pass (heuristic) detector, I also added backoff 
if there is a parse exception to treat whatever is left in the Reader as if it 
is plain text.  We could customize the reader (wrap it in something) to capture 
content that is buffered in the o.a.c.csv.CSVParser when the exception was hit.


was (Author: [email protected]):
The real test will be against the full corpus to see how many false positives 
we have for files identified as csv but are actually plain text.

In addition to adding a first pass (horrifically heuristic) detector, I also 
added backoff if there is a parse exception to treat whatever is left in the 
Reader as if it is plain text.  We could customize the reader (wrap it in 
something) to capture content that is buffered in the o.a.c.csv.CSVParser when 
the exception was hit.

> Add a CSV/TSV detector
> ----------------------
>
>                 Key: TIKA-2833
>                 URL: https://issues.apache.org/jira/browse/TIKA-2833
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: csv_reports.zip
>
>
> Given initial experimentation, I think we can fairly easily add a fairly 
> robust CSV/TSV detector that will identify well-formed (ha!) csvs and return 
> the charset encoding and the delimiter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to