[jira] [Commented] (TIKA-1528) Add an OverrideDetector that overrides other detectors

Nick Burch (JIRA) Thu, 22 Jan 2015 11:41:57 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288086#comment-14288086
 ]


Nick Burch commented on TIKA-1528:
----------------------------------

Ah, right, I think I get it. The SQLiteParser knows it has a table, and knows 
that needs the JDBCTableParser, but wants to go via an Extractor in order for 
each table to be treated individually if required, is that it?

If so, since you control both ends, you could always cheat... Pop the table 
onto the TikaInputStream as an open container, provide an empty byte array as 
data, put the table mimetype on the metadata along with the table name as the 
resource name, hand that off to the EmbeddedDocumentExtractor, and wait for 
that special TikaInputStream to appear at the table parser. With no data, the 
other parsers will decline to do anything, so the mimetype on the metadata will 
win and your table parser will get the TikaInputStream + you then grab the real 
table details off the open container

All depends on if you think a parser which didn't know about the special jdbc 
table connection thingy would ever be able to do something useful with the 
table or not?

> Add an OverrideDetector that overrides other detectors
> ------------------------------------------------------
>
>                 Key: TIKA-1528
>                 URL: https://issues.apache.org/jira/browse/TIKA-1528
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> While working on TIKA-1511, I found a need to bypass our current detection 
> mechanism.  I think that there are other use cases for this.  The idea is 
> that a client or a tika-internal call wants to specify the Content-Type for a 
> document and bypass the regular mime detection chain.
> We currently have the TypeDetector that returns the "Content-Type" as 
> specified in the Metadata, but there are two deficiencies in using that class 
> for this purpose:
> * Content-Type is ambiguous, currently, when it comes into a Parser or 
> Detector, it could be used as a hint or as a direction.  I'd like the 
> OverrideDetector to use a different metadata key from our usual "Content-Type.
> * The ordering of the TypeDetector is based on alphabetic order of its class 
> name.  I'd like the OverrideDetector to be run first and then short 
> circuit/bypass the other detectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1528) Add an OverrideDetector that overrides other detectors

Reply via email to