[ 
https://issues.apache.org/jira/browse/TIKA-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3137.
-------------------------------
    Fix Version/s: 1.25
         Assignee: Tim Allison
       Resolution: Fixed

I'm declaring victory on this for now. 

[~Mandalka], please do take a look and let us know if we can make any 
improvements.

> Enable a metadata filter for the RecursiveParserWrapper
> -------------------------------------------------------
>
>                 Key: TIKA-3137
>                 URL: https://issues.apache.org/jira/browse/TIKA-3137
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.25
>
>
> The RecursiveParserWrapper is designed to extract all metadata from every 
> embedded file.  Some users may need more targeted ways of filtering the 
> metadata to save on resources, e.g. memory, disc or transfer-size/bandwidth 
> in tika-server.
> Some use cases that come to mind:
> * A user only wants the title, author and content fields.
> * A user doesn't want content from EMF files, but does want the content from 
> a PDF embedded inside an EMF file.
> * This could be an avenue for text-based enrichment, e.g. run NER on the 
> content field and add those recognized entities to the Metadata; or tika-eval 
> statistics...
> The last point may require further discussion.  We have some handlers that 
> require buffering the full text of a document and then running extraction 
> (Phone number extractor?).  The downside to this is that we're storing two 
> copies of the data in memory.  For at least the RPW, it would be more 
> efficient to do postprocessing on the one buffered copy.
> Some open questions: how do we configure the choice of filter(s), do we apply 
> this to the AutoDetectParser...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to