Tim Allison created TIKA-4638:
---------------------------------
Summary: Unify sax "style" configurations in 4.x
Key: TIKA-4638
URL: https://issues.apache.org/jira/browse/TIKA-4638
Project: Tika
Issue Type: Task
Reporter: Tim Allison
We've had ongoing needs for easy user configuration for:
a) include embedded filenames in the sax output or not
b) include the metadata title in the sax output or not
Further, with RMETA or the json output of CONCATENATE, if a user wants xhtml as
the sax output type, there is typically no need to dump the metadata into the
xhtml. We should make this configurable as well.
The key point here and on TIKA-4633 is that the user should only have to touch
one logical configuration object, even though different underlying components
in Tika will act on those. For example, in this case, the metadata/title stuff
is handled in the XHTMLContentHandler, and the embedded filenames would be
handled in the ParsingEmbeddedDocumentExtractor.
I think for some of the config objects, we should simplify for the user's sake
and not require them to know the underlying components.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)