[ 
https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777934#comment-16777934
 ] 

Tim Allison commented on TIKA-2833:
-----------------------------------

Initial question is where to place this detector. It should only be triggered 
after all of the other user-specified detectors _and_ after MimeTypes.

Some options:
1) Build it into MimeTypes and run it only once MimeTypes is about to return 
{{text/plain}} -- I don't want to hardwire this into MimeTypes, though.
2) Run it in TXTParser before parsing the text...I don't like this because it 
bypasses the usual detector configurability and hardwires it into TXTParser.
3) Manually add it after adding MimeTypes in 
DefaultDetector.getDefaultDetectors() -- I like this because users can 
configure turning it off, but it is smelly/hacky
4) Create a separate class (LowPriorityDetector (ugh!)) or add a parameter for 
sorting that will guarantee that the CSVDetector is run after MimeTypes.
5) Make CSVParser allege that it can parse {{text/plain}}, run its detection 
before the parse and if it detects regular text and/or not a CSV, back off to 
the TXTParser or replicate TXTParser's behavior.  This would allow users to 
turn off the CSVParser and detection via the usual {{exclude}} option on the 
CSVParser.  

Any recommendations/preferences?  I'm currently inclined to 5, but I suspect 
there may be a more elegant answer.



> Add a CSV/TSV detector
> ----------------------
>
>                 Key: TIKA-2833
>                 URL: https://issues.apache.org/jira/browse/TIKA-2833
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> Given initial experimentation, I think we can fairly easily add a fairly 
> robust CSV/TSV detector that will identify well-formed (ha!) csvs and return 
> the charset encoding and the delimiter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to