Tim Allison created TIKA-2096:
---------------------------------
Summary: Tika 2.0 -- Supply AutoDetectParser for embedded
documents if user forgets to pass it in via ParseContext
Key: TIKA-2096
URL: https://issues.apache.org/jira/browse/TIKA-2096
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
Currently, if users don't specify a Parser.class or an
EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not
be parsed. I propose that we add an AutoDetectParser automatically if a Parser
or EmbeddedDocumentExtractor is not included in the ParseContext.
If a user doesn't want to parse embedded objects, s/he could pass in an
EmptyParser for the Parser.class.
In short, let's make the default be "parse everything", and the user has to
figure out how to parse only the container document if that's the desired
behavior.
This is a breaking change. I propose adding it to 2.0 only.
We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been
bitten by this. [Kite|https://github.com/kite-sdk/kite/issues/397] is still
suffering from this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)