[
https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777934#comment-16777934
]
Tim Allison commented on TIKA-2833:
-----------------------------------
Initial question is where to place this detector. It should only be triggered
after all of the other user-specified detectors _and_ after MimeTypes.
Some options:
1) Build it into MimeTypes and run it only once MimeTypes is about to return
{{text/plain}} -- I don't want to hardwire this into MimeTypes, though.
2) Run it in TXTParser before parsing the text...I don't like this because it
bypasses the usual detector configurability and hardwires it into TXTParser.
3) Manually add it after adding MimeTypes in
DefaultDetector.getDefaultDetectors() -- I like this because users can
configure turning it off, but it is smelly/hacky
4) Create a separate class (LowPriorityDetector (ugh!)) or add a parameter for
sorting that will guarantee that the CSVDetector is run after MimeTypes.
5) Make CSVParser allege that it can parse {{text/plain}}, run its detection
before the parse and if it detects regular text and/or not a CSV, back off to
the TXTParser or replicate TXTParser's behavior. This would allow users to
turn off the CSVParser and detection via the usual {{exclude}} option on the
CSVParser.
Any recommendations/preferences? I'm currently inclined to 5, but I suspect
there may be a more elegant answer.
> Add a CSV/TSV detector
> ----------------------
>
> Key: TIKA-2833
> URL: https://issues.apache.org/jira/browse/TIKA-2833
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> Given initial experimentation, I think we can fairly easily add a fairly
> robust CSV/TSV detector that will identify well-formed (ha!) csvs and return
> the charset encoding and the delimiter.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)