[ 
https://issues.apache.org/jira/browse/TIKA-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159441#comment-17159441
 ] 

Tim Allison commented on TIKA-3137:
-----------------------------------

Rough draft pushed to: https://github.com/apache/tika/tree/TIKA-3137

Let me know what you think.

> Enable a metadata filter for the RecursiveParserWrapper
> -------------------------------------------------------
>
>                 Key: TIKA-3137
>                 URL: https://issues.apache.org/jira/browse/TIKA-3137
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> The RecursiveParserWrapper is designed to extract all metadata from every 
> embedded file.  Some users may need more targeted ways of filtering the 
> metadata to save on resources, e.g. memory, disc or transfer-size/bandwidth 
> in tika-server.
> Some use cases that come to mind:
> * A user only wants the title, author and content fields.
> * A user doesn't want content from EMF files, but does want the content from 
> a PDF embedded inside an EMF file.
> * This could be an avenue for text-based enrichment, e.g. run NER on the 
> content field and add those recognized entities to the Metadata; or tika-eval 
> statistics...
> The last point may require further discussion.  We have some handlers that 
> require buffering the full text of a document and then running extraction 
> (Phone number extractor?).  The downside to this is that we're storing two 
> copies of the data in memory.  For at least the RPW, it would be more 
> efficient to do postprocessing on the one buffered copy.
> Some open questions: how do we configure the choice of filter(s), do we apply 
> this to the AutoDetectParser...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to