[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Tim Allison (Jira) Thu, 17 Mar 2022 12:56:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508181#comment-17508181
 ]


Tim Allison edited comment on TIKA-3695 at 3/17/22, 7:55 PM:
-------------------------------------------------------------

Thank you for taking a look!  I think I have an option that isn't too awful.  
I'll try to commit that in a dev branch this morning.

Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser 
after the metadata object has been created and potentially written to, we need 
to either a) ignore what was written previously (although take the already 
written data size into consideration for the write limit) or b) remove data 
that has already been added.  I'm slightly inclined to b).

2) Some of the parsers rely on data in the metadata object especially when 
parsing embedded files.  I'd like to add a user-configured "includeField" Set 
to allow users to limit which fields are written.  There's a stink here.  
Either we always include fields we currently know other parsers rely on 
(content-type, length) or we allow users to break parser functionality if users 
forget to include these fields.  I'm inclined towards the former but there's an 
ugly risk that parsers in the future might subtly rely on other fields.

3) The current workflow sets the writeFilter via the AutoDetectParser.  It does 
not unset it at the end of the parse.  In short, this will affect data added 
after the parse.  I can't think of anything that does this now, but we may want 
to revisit this behavior at some point.


was (Author: [email protected]):
Thank you for taking a look!  I think I have an option that isn't too awful.  
I'll try to commit that in a dev branch this morning.

Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser 
after the metadata object has been created and potentially written to, we need 
to either a) ignore what was written previously (although take the already 
written data size into consideration for the write limit) or b) remove data 
that has already been added.  I'm slightly inclined to b).
2) Some of the parsers rely on data in the metadata object especially when 
parsing embedded files.  I'd like to add a user-configured "includeField" Set 
to allow users to limit which fields are written.  There's a stink here.  
Either we always include fields we currently know other parsers rely on 
(content-type, length) or we allow users to break parser functionality if users 
forget to include these fields.  I'm inclined towards the former but there's an 
ugly risk that parsers in the future might subtly rely on other fields.

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (TIKA-3695) LimitingMetadataFilter

Reply via email to