[
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507850#comment-17507850
]
Tim Allison edited comment on TIKA-3695 at 3/16/22, 8:08 PM:
-------------------------------------------------------------
This is messier than I'd like. Some thoughts
* I don't want to have to change every call to new Metadata() in our codebase;
e.g. we could put a metadata filter factory/wrapper in the parsecontext and
then have every parser call it instead of "new Metadata()", but this is messy
and error prone for new parsers handling embedded files.
* I tried subclassing Metadata and putting the limiting features in the
add(String, String) and set(String, String). This kind of works, but having the
AutoDetectParser wrap the metadata that was sent in with this obviously doesn't
work because this isn't modifying the user supplied metadata, and there's no
reference to the data in the new object via the original Metadata object.
* I tried subclassing Metadata with a proxy pointer to the original Metadata
object, but this gets just as complicated as the original Metadata class
because in order to do the correct limitations on Properties and handling
multivalued properties with primary and secondary keys, I would wind up
duplicating the Metadata object, but with some write limits.
Unless anyone has objections, I feel like the cleanest thing to do is to add
the limitations into the set(String, String) and add(String, String) to the
actual Metadata object. We can set the limitations via the AutoDetectParser.
I don't like this option because it adds more complexity to the Metadata
object, but I don't see better alternatives.
Thoughts?
was (Author: [email protected]):
This is messier than I'd like. Some thoughts
* I don't want to have to change every call to new Metadata() in our codebase;
e.g. we could put a metadata filter factory/wrapper in the parsecontext, but
this is messy and error prone for new parsers handling embedded files.
* I tried subclassing Metadata and putting the limiting features in the
add(String, String) and set(String, String). This kind of works, but having the
AutoDetectParser wrap the metadata that was sent in with this obviously doesn't
work because this isn't modifying the user supplied metadata.
* I tried subclassing Metadata with a proxy pointer to the original Metadata
object, but this gets just as complicated as the original Metadata class
because in order to do the correct limitations on Properties and handling
multivalued properties with primary and secondary keys, I would wind up
duplicating the Metadata object, but with some write limits.
Unless anyone has objections, I feel like the cleanest thing to do is to add
the limitations into the set(String, String) and add(String, String) to the
actual Metadata object. We can set the limitations via the AutoDetectParser.
I don't like this option because it adds more complexity to the Metadata
object, but I don't see better alternatives.
Thoughts?
> LimitingMetadataFilter
> ----------------------
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
> Issue Type: New Feature
> Components: metadata
> Affects Versions: 1.28.1, 2.3.0
> Reporter: Julien Massiera
> Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the
> metadata values, the metadata names, but also for the total amount of
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can
> filter out the metadata according to different bytes limits (on metadata
> names, metadata values and global amount of metadata)
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)