[ 
https://issues.apache.org/jira/browse/TIKA-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159523#comment-17159523
 ] 

Markus Mandalka commented on TIKA-3137:
---------------------------------------

"A user only wants the title, author and content fields."

It should support both paradigmas: "Include only this fields and exclude all 
other" and "get all fields but exclude this fields" (Filtering whole 
Prefixes/suffixes would be great too).

Because i know that i do not want certain fields like some techical metadata 
like color format of images, but i want index and search in all other fields 
(maybe some not yet known by me since yet no file with them or future 
interesting/relevant metadata fields).

You can see some examples of my handling/excluding Tika fields from text 
analysis here (which is a quick and dirty solution not to use my analysis 
resources and fill search index with for my search usecases irrelevant 
technical/fileformat metadata):

[https://github.com/opensemanticsearch/open-semantic-etl/tree/master/etc/opensemanticsearch/blacklist/textanalysis]

 

Implementing this longer time ago i wished i had an structured list with 
classified tika metadata fields to separate many (for my usecase not relevant) 
technical metadata fields like image colors from for my usecase relevant 
metadata fields including relevant content like author name.

> Enable a metadata filter for the RecursiveParserWrapper
> -------------------------------------------------------
>
>                 Key: TIKA-3137
>                 URL: https://issues.apache.org/jira/browse/TIKA-3137
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> The RecursiveParserWrapper is designed to extract all metadata from every 
> embedded file.  Some users may need more targeted ways of filtering the 
> metadata to save on resources, e.g. memory, disc or transfer-size/bandwidth 
> in tika-server.
> Some use cases that come to mind:
> * A user only wants the title, author and content fields.
> * A user doesn't want content from EMF files, but does want the content from 
> a PDF embedded inside an EMF file.
> * This could be an avenue for text-based enrichment, e.g. run NER on the 
> content field and add those recognized entities to the Metadata; or tika-eval 
> statistics...
> The last point may require further discussion.  We have some handlers that 
> require buffering the full text of a document and then running extraction 
> (Phone number extractor?).  The downside to this is that we're storing two 
> copies of the data in memory.  For at least the RPW, it would be more 
> efficient to do postprocessing on the one buffered copy.
> Some open questions: how do we configure the choice of filter(s), do we apply 
> this to the AutoDetectParser...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to