Tim Allison created TIKA-4638:
---------------------------------

             Summary: Unify sax "style" configurations in 4.x
                 Key: TIKA-4638
                 URL: https://issues.apache.org/jira/browse/TIKA-4638
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


We've had ongoing needs for easy user configuration for:

a) include embedded filenames in the sax output or not
b) include the metadata title in the sax output or not

Further, with RMETA or the json output of CONCATENATE, if a user wants xhtml as 
the sax output type, there is typically no need to dump the metadata into the 
xhtml. We should make this configurable as well.

The key point here and on TIKA-4633 is that the user should only have to touch 
one logical configuration object, even though different underlying components 
in Tika will act on those. For example, in this case, the metadata/title stuff 
is handled in the XHTMLContentHandler, and the embedded filenames would be 
handled in the ParsingEmbeddedDocumentExtractor.

I think for some of the config objects, we should simplify for the user's sake 
and not require them to know the underlying components.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to