[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Tim Allison (Jira) Wed, 09 Mar 2022 07:48:06 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503661#comment-17503661
 ]


Tim Allison edited comment on TIKA-3695 at 3/9/22, 3:47 PM:
------------------------------------------------------------

On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.  

If a parser tries to write too much metadata, do we throw a WriteLimitException 
and stop parsing, or do we keep parsing but add a "metadata truncation" flag to 
the metadata object?  I'd be inclined to the latter.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 


was (Author: [email protected]):
On the list, I suggested implementing this as a MetadataFilter.  These are used 
by the RecursiveParserWrapper (e.g. /rmeta in tika-server or '-J' option in 
tika-app).  They are triggered after the parse of the file. 

On further thought, I'm not sure this is the best option for two reason.  The 
metadata would be extracted and stored in the Metadata object within Tika, but 
then truncated/removed before returning the data to the user...so the data will 
still be in memory and will consume Tika memory resources until the file's 
parsing is finished.  Also,  this solution will not play well with the 
traditional xhtml output of /tika where whatever is in the metadata object is 
written when the parser hits the first bit of content text, not after the parse 
has concluded.

I'm wondering if we need to put these protections deeper into the Metadata 
object itself so that it isn't storing this info and then removing it.

Thoughts?  How do we configure it...new parameters on AutoDetectParser? 

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Reply via email to