[Tika Wiki] Update of "RecursiveMetadata" by TimothyAllison

Apache Wiki Fri, 19 Dec 2014 08:41:56 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "RecursiveMetadata" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/RecursiveMetadata?action=diff&rev1=8&rev2=9

  
  If you aren't interested in seeing text and metadata for the zip file itself, 
you'll want to take a look at {{{metadata.get(Metadata.CONTENT_TYPE))}}} for 
each file Tika parses so you can skip the archives themselves. For a zip file, 
the content type is "application/zip".
  
+ = Integration of the RecursiveParserWrapper into Tika =
+ A RecursiveParserWrapper that is based on Jukka and Nick's example above was 
added to Tika as of 1.7.
+ 
+ The wrapper returns a list of Metadata objects -- the first contains the 
metadata+content for the container document and the rest contain the 
metadata+content for each embedded document. The content of each document is 
stored in "X-TIKA:content", and the embedded document's location in the 
container document is stored in "X-TIKA:embedded_resource_path" (e.g. 
"embedded-1/embed1.zip/embed2.zip/embed3.pdf").
+ 
+ A downside to the wrapper is that it breaks the Tika goal of streaming output 
-- the wrapper caches all metadata+content in memory.  This wrapper must be 
used with care.
+ 
+ As of Tika 1.7, a JSONified view of this output was integrated into tika-app 
(the -J option) and tika-server ("/rmeta").
+ 
+ This format serves as the basis for the upcoming tika-eval module that will 
help with comparisons of the output of different versions of Tika or other 
content extractors.
+

[Tika Wiki] Update of "RecursiveMetadata" by TimothyAllison

Reply via email to