Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "RecursiveMetadata" page has been changed by TimothyAllison: https://wiki.apache.org/tika/RecursiveMetadata?action=diff&rev1=8&rev2=9 If you aren't interested in seeing text and metadata for the zip file itself, you'll want to take a look at {{{metadata.get(Metadata.CONTENT_TYPE))}}} for each file Tika parses so you can skip the archives themselves. For a zip file, the content type is "application/zip". + = Integration of the RecursiveParserWrapper into Tika = + A RecursiveParserWrapper that is based on Jukka and Nick's example above was added to Tika as of 1.7. + + The wrapper returns a list of Metadata objects -- the first contains the metadata+content for the container document and the rest contain the metadata+content for each embedded document. The content of each document is stored in "X-TIKA:content", and the embedded document's location in the container document is stored in "X-TIKA:embedded_resource_path" (e.g. "embedded-1/embed1.zip/embed2.zip/embed3.pdf"). + + A downside to the wrapper is that it breaks the Tika goal of streaming output -- the wrapper caches all metadata+content in memory. This wrapper must be used with care. + + As of Tika 1.7, a JSONified view of this output was integrated into tika-app (the -J option) and tika-server ("/rmeta"). + + This format serves as the basis for the upcoming tika-eval module that will help with comparisons of the output of different versions of Tika or other content extractors. +
