[
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508181#comment-17508181
]
Tim Allison edited comment on TIKA-3695 at 3/17/22, 1:00 PM:
-------------------------------------------------------------
Thank you for taking a look! I think I have an option that isn't too awful.
I'll try to commit that in a dev branch this morning.
Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser
after the metadata object has been created and potentially written to, we need
to either a) ignore what was written previously (although take the already
written data size into consideration for the write limit) or b) remove data
that has already been added. I'm slightly inclined to b).
2) Some of the parsers rely on data in the metadata object especially when
parsing embedded files. I'd like to add a user-configured "includeField" Set
to allow users to limit which fields are written. There's a stink here.
Either we always include fields we currently know other parsers rely on
(content-type, length) or we allow users to break parser functionality if users
forget to include these fields. I'm inclined towards the former but there's an
ugly risk that parsers in the future might subtly rely on other fields.
was (Author: [email protected]):
Thank you for taking a look! I think I have an option that isn't too awful.
I'll try to commit that in a dev branch this morning.
Two other issues:
1) If we add the filtering to the metadata object via the AutoDetectParser
after the metadata object has been created and potentially written to, we need
to either a) ignore what was written previously (although take the already
written data size into consideration for the write limit) or b) remove data
that has already been added. I'm slightly inclined to b).
2) Some of the parsers rely on data in the metadata object especially when
parsing embedded files. I'd like to add an "includeField" Set to allow users
to limit which fields are written. There's a stink here. Either we always
include fields we currently know other parsers rely on (content-type, length)
or we allow users to break parser functionality if users forget to include
these fields. I'm inclined towards the former but there's an ugly risk that
parsers in the future might subtly rely on other fields.
> LimitingMetadataFilter
> ----------------------
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
> Issue Type: New Feature
> Components: metadata
> Affects Versions: 1.28.1, 2.3.0
> Reporter: Julien Massiera
> Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the
> metadata values, the metadata names, but also for the total amount of
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can
> filter out the metadata according to different bytes limits (on metadata
> names, metadata values and global amount of metadata)
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)