[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Tim Allison (Jira) Thu, 17 Mar 2022 06:00:18 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508181#comment-17508181
 ]


Tim Allison edited comment on TIKA-3695 at 3/17/22, 1:00 PM:
-------------------------------------------------------------

Thank you for taking a look!  I think I have an option that isn't too awful.  
I'll try to commit that in a dev branch this morning.

Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser 
after the metadata object has been created and potentially written to, we need 
to either a) ignore what was written previously (although take the already 
written data size into consideration for the write limit) or b) remove data 
that has already been added.  I'm slightly inclined to b).
2) Some of the parsers rely on data in the metadata object especially when 
parsing embedded files.  I'd like to add a user-configured "includeField" Set 
to allow users to limit which fields are written.  There's a stink here.  
Either we always include fields we currently know other parsers rely on 
(content-type, length) or we allow users to break parser functionality if users 
forget to include these fields.  I'm inclined towards the former but there's an 
ugly risk that parsers in the future might subtly rely on other fields.


was (Author: [email protected]):
Thank you for taking a look!  I think I have an option that isn't too awful.  
I'll try to commit that in a dev branch this morning.

Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser 
after the metadata object has been created and potentially written to, we need 
to either a) ignore what was written previously (although take the already 
written data size into consideration for the write limit) or b) remove data 
that has already been added.  I'm slightly inclined to b).
2) Some of the parsers rely on data in the metadata object especially when 
parsing embedded files.  I'd like to add an "includeField" Set to allow users 
to limit which fields are written.  There's a stink here.  Either we always 
include fields we currently know other parsers rely on (content-type, length) 
or we allow users to break parser functionality if users forget to include 
these fields.  I'm inclined towards the former but there's an ugly risk that 
parsers in the future might subtly rely on other fields.

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Reply via email to