[
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146187#comment-14146187
]
Nick Burch commented on TIKA-1420:
----------------------------------
For now, I'd suggest putting this into the Examples package, then the
additional dependency should be fine.
Characters wise, you might need to use some sort of rolling buffer for the
detection, in case the number gets split between multiple character calls (eg
part of it is styled, part not, so in different tags, or just fits across a
text size boundary), but for the initial version just checking the characters
before passing them on should work fine
> Add Metadata Extraction to Arbitrary Parsers
> --------------------------------------------
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Tyler Palsulich
> Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it
> to a Metadata Object. This type of task is best handled by a... Handler. But,
> Handlers do not have access to the Metadata Object passed to a Parser.
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a
> handler, then convert the XML to the Metadata after parsing.
> 2) Create a second Parser which extracts the desired information.
> a) Assume the Handler passed to this Parser is already filled with
> content. So, we could simply get whatever content from the Handler and
> populate the Metadata directly.
> b) Create a new Stream in the first Parser to pass to the second, which
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this
> scenario? Or, can we create some sort of... wrapper for a Handler which can
> accept a Metadata Object to populate directly?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)