[ 
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027967#comment-14027967
 ] 

Tim Allison commented on TIKA-1329:
-----------------------------------

I agree on all points.  The proposal in this patch is to be able to grab a 
List<Metadata> after a document is parsed.  The user can select whether they 
want text, xml or html for the content format, and then the content is stored 
in a specific field within the metadata. The first item in the list is the 
container document, and there is not necessarily any order to the list after 
the first item.  I'm also storing (thanks to you!) the relative path as a 
separate field within the metadata.  

I don't like the current design because it violates the streaming design plan 
Tika...it caches data in memory.  However, given that a parser might stop to 
process an embedded document before it has completed the content (and even the 
metadata?) of the main document, this approach offers a somewhat clean view of 
a document. I also don't like that the user doesn't have more freedom to select 
a ContentHandler.  However, it has been an extremely easy format to use to 
compare outputs from different versions of Tika in the early stages of 
development for TIKA-1302.  

This format also takes a different view of "content," namely it could be 
considered just another type of metadata (this is the text we could extract 
from this document vs. the actual bytes).



> Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1329
>                 URL: https://issues.apache.org/jira/browse/TIKA-1329
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>
> Jukka and Nick have a great demo of parsing metadata recursively on the 
> [wiki|http://wiki.apache.org/tika/RecursiveMetadata].  For TIKA-1302, I'd 
> like to use something similar, and I think that others may find it useful for 
> tika-app and tika-server.
> I took the code from the wiki and made some modifications.  I'm not sure if 
> we should put this in parsers or in a new module for "examples."  Given that 
> I think this would be useful for tika-app and tika-server, I'd prefer 
> parsers, but I'm open to any input...including "let's not."
> I opened up a review board issue here: 
> [rb|http://reviews.apache.org/r/22433]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to