[
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027967#comment-14027967
]
Tim Allison commented on TIKA-1329:
-----------------------------------
I agree on all points. The proposal in this patch is to be able to grab a
List<Metadata> after a document is parsed. The user can select whether they
want text, xml or html for the content format, and then the content is stored
in a specific field within the metadata. The first item in the list is the
container document, and there is not necessarily any order to the list after
the first item. I'm also storing (thanks to you!) the relative path as a
separate field within the metadata.
I don't like the current design because it violates the streaming design plan
Tika...it caches data in memory. However, given that a parser might stop to
process an embedded document before it has completed the content (and even the
metadata?) of the main document, this approach offers a somewhat clean view of
a document. I also don't like that the user doesn't have more freedom to select
a ContentHandler. However, it has been an extremely easy format to use to
compare outputs from different versions of Tika in the early stages of
development for TIKA-1302.
This format also takes a different view of "content," namely it could be
considered just another type of metadata (this is the text we could extract
from this document vs. the actual bytes).
> Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
> ---------------------------------------------------------------------------
>
> Key: TIKA-1329
> URL: https://issues.apache.org/jira/browse/TIKA-1329
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 1.7
>
>
> Jukka and Nick have a great demo of parsing metadata recursively on the
> [wiki|http://wiki.apache.org/tika/RecursiveMetadata]. For TIKA-1302, I'd
> like to use something similar, and I think that others may find it useful for
> tika-app and tika-server.
> I took the code from the wiki and made some modifications. I'm not sure if
> we should put this in parsers or in a new module for "examples." Given that
> I think this would be useful for tika-app and tika-server, I'd prefer
> parsers, but I'm open to any input...including "let's not."
> I opened up a review board issue here:
> [rb|http://reviews.apache.org/r/22433]
--
This message was sent by Atlassian JIRA
(v6.2#6252)