[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507850#comment-17507850
 ] 

Tim Allison edited comment on TIKA-3695 at 3/16/22, 8:08 PM:
-------------------------------------------------------------

This is messier than I'd like.  Some thoughts
* I don't want to have to change every call to new Metadata() in our codebase; 
e.g. we could put a metadata filter factory/wrapper  in the parsecontext and 
then have every parser call it instead of "new Metadata()", but this is messy 
and error prone for new parsers handling embedded files.

* I tried subclassing Metadata and putting the limiting features in the 
add(String, String) and set(String, String). This kind of works, but having the 
AutoDetectParser wrap the metadata that was sent in with this obviously doesn't 
work because this isn't modifying the user supplied metadata, and there's no 
reference to the data in the new object via the original Metadata object.

* I tried subclassing Metadata with a proxy pointer to the original Metadata 
object, but this gets just as complicated as the original Metadata class 
because in order to do the correct limitations on Properties and handling 
multivalued properties with primary and secondary keys, I would wind up 
duplicating the Metadata object, but with some write limits.

Unless anyone has objections, I feel like the cleanest thing to do is to add 
the limitations into the set(String, String) and add(String, String) to the 
actual Metadata object.  We can set the limitations via the AutoDetectParser.

I don't like this option because it adds more complexity to the Metadata 
object, but I don't see better alternatives.

Thoughts?


was (Author: [email protected]):
This is messier than I'd like.  Some thoughts
* I don't want to have to change every call to new Metadata() in our codebase; 
e.g. we could put a metadata filter factory/wrapper  in the parsecontext, but 
this is messy and error prone for new parsers handling embedded files.

* I tried subclassing Metadata and putting the limiting features in the 
add(String, String) and set(String, String). This kind of works, but having the 
AutoDetectParser wrap the metadata that was sent in with this obviously doesn't 
work because this isn't modifying the user supplied metadata.

* I tried subclassing Metadata with a proxy pointer to the original Metadata 
object, but this gets just as complicated as the original Metadata class 
because in order to do the correct limitations on Properties and handling 
multivalued properties with primary and secondary keys, I would wind up 
duplicating the Metadata object, but with some write limits.

Unless anyone has objections, I feel like the cleanest thing to do is to add 
the limitations into the set(String, String) and add(String, String) to the 
actual Metadata object.  We can set the limitations via the AutoDetectParser.

I don't like this option because it adds more complexity to the Metadata 
object, but I don't see better alternatives.

Thoughts?

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to