[
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
]
Tim Allison commented on TIKA-3695:
-----------------------------------
On the list, I suggested implementing this as a MetadataFilter. These are used
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in
tika-app). They are triggered after the parse of the file.
In this case, the metadata would be extracted and stored in the Metadata object
within Tika, but then truncated/removed before returning the data to the user.
This solution will not play well with the traditional xhtml output of /tika
where whatever is in the metadata object is written when the parser hits the
first bit of content text.
I'm wondering if we need to put these protections deeper into the Metadata
object itself so that it isn't storing this info and then removing it.
Thoughts? How do we configure it...new parameters on AutoDetectParser?
> LimitingMetadataFilter
> ----------------------
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
> Issue Type: New Feature
> Components: metadata
> Affects Versions: 1.28.1, 2.3.0
> Reporter: Julien Massiera
> Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the
> metadata values, the metadata names, but also for the total amount of
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can
> filter out the metadata according to different bytes limits (on metadata
> names, metadata values and global amount of metadata)
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)