[
https://issues.apache.org/jira/browse/TIKA-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159523#comment-17159523
]
Markus Mandalka commented on TIKA-3137:
---------------------------------------
"A user only wants the title, author and content fields."
It should support both paradigmas: "Include only this fields and exclude all
other" and "get all fields but exclude this fields" (Filtering whole
Prefixes/suffixes would be great too).
Because i know that i do not want certain fields like some techical metadata
like color format of images, but i want index and search in all other fields
(maybe some not yet known by me since yet no file with them or future
interesting/relevant metadata fields).
You can see some examples of my handling/excluding Tika fields from text
analysis here (which is a quick and dirty solution not to use my analysis
resources and fill search index with for my search usecases irrelevant
technical/fileformat metadata):
[https://github.com/opensemanticsearch/open-semantic-etl/tree/master/etc/opensemanticsearch/blacklist/textanalysis]
Implementing this longer time ago i wished i had an structured list with
classified tika metadata fields to separate many (for my usecase not relevant)
technical metadata fields like image colors from for my usecase relevant
metadata fields including relevant content like author name.
> Enable a metadata filter for the RecursiveParserWrapper
> -------------------------------------------------------
>
> Key: TIKA-3137
> URL: https://issues.apache.org/jira/browse/TIKA-3137
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> The RecursiveParserWrapper is designed to extract all metadata from every
> embedded file. Some users may need more targeted ways of filtering the
> metadata to save on resources, e.g. memory, disc or transfer-size/bandwidth
> in tika-server.
> Some use cases that come to mind:
> * A user only wants the title, author and content fields.
> * A user doesn't want content from EMF files, but does want the content from
> a PDF embedded inside an EMF file.
> * This could be an avenue for text-based enrichment, e.g. run NER on the
> content field and add those recognized entities to the Metadata; or tika-eval
> statistics...
> The last point may require further discussion. We have some handlers that
> require buffering the full text of a document and then running extraction
> (Phone number extractor?). The downside to this is that we're storing two
> copies of the data in memory. For at least the RPW, it would be more
> efficient to do postprocessing on the one buffered copy.
> Some open questions: how do we configure the choice of filter(s), do we apply
> this to the AutoDetectParser...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)