Tim Allison created TIKA-2096:
---------------------------------

             Summary: Tika 2.0 -- Supply AutoDetectParser for embedded 
documents if user forgets to pass it in via ParseContext
                 Key: TIKA-2096
                 URL: https://issues.apache.org/jira/browse/TIKA-2096
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


Currently, if users don't specify a Parser.class or an 
EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not 
be parsed.  I propose that we add an AutoDetectParser automatically if a Parser 
or EmbeddedDocumentExtractor is not included in the ParseContext.

If a user doesn't want to parse embedded objects, s/he could pass in an 
EmptyParser for the Parser.class.

In short, let's make the default be "parse everything", and the user has to 
figure out how to parse only the container document if that's the desired 
behavior.

This is a breaking change.  I propose adding it to 2.0 only.

We were bitten by this on tika-server (TIKA-1584).  Solr (SOLR-7189) has been 
bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still 
suffering from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to