[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510737#comment-17510737
 ] 

Tim Allison edited comment on TIKA-3695 at 3/22/22, 5:04 PM:
-------------------------------------------------------------

The cause is that the mime 
{{application/vnd.openxmlformats-officedocument.wordprocessingml.document}} is 
71 characters/142 bytes.  That gets truncated and then no parser is selected.

I changed the behavior to apply the field limit even to the alwaysSet fields 
because some of those fields can be set by parsers/user input, and I didn't 
want a DoS in those alwaysSet fields.

Recommendations for better behavior?  

Should we set a hard limit on the maximum field length to be > the longest mime 
type?  

We could set a separate unmodifiable field length limit on alwaysSet fields so 
that users could still control the other fields at < 150, for example?

Should we not apply the field limit to the alwaysSet?



was (Author: [email protected]):
The cause is that the mime 
{{application/vnd.openxmlformats-officedocument.wordprocessingml.document}} is 
71 characters/142 bytes.  That gets truncated and then no parser is selected.

I changed the behavior to apply the field limit even to the alwaysSet fields 
because some of those fields can be set by parsers/user input, and I didn't 
want a DoS in those alwaysSet fields.

Recommendations for better behavior?

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: huge-title.docx, tika-config.xml
>
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to