[
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509041#comment-17509041
]
Julien Massiera commented on TIKA-3695:
---------------------------------------
I am not sure I understood how it works, I have configured the following in my
tika-config.xml file for my tika-server-standard built with the commit
'000abdcf70112df1a2a9a433e308c1fe5db1d45e'
{code:java}
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
<autoDetectParserConfig>
<metadataWriteFilterFactory
class="org.apache.tika.metadata.StandardWriteFilterFactory">
<params>
<maxEstimatedBytes>20</maxEstimatedBytes>
</params>
</metadataWriteFilterFactory>
</autoDetectParserConfig> {code}
Then I sent a docx containing a description metadata with more than 50k chars
with the following command:
{code:java}
curl -H "writeLimit:1000000" -T bigmetadata.docx
http://localhost:9998/rmeta/text > tika-extract.json {code}
In the tika-extract.json result, I get the dc:description metadata containing
the 50k chars and I don't have the tika exception flag "metadata_limit_reached"
What is wrong in what I am doing ?
> LimitingMetadataFilter
> ----------------------
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
> Issue Type: New Feature
> Components: metadata
> Affects Versions: 1.28.1, 2.3.0
> Reporter: Julien Massiera
> Priority: Major
> Fix For: 2.3.1
>
>
> Some files may contain abnormally big metadata (several MB, be it for the
> metadata values, the metadata names, but also for the total amount of
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can
> filter out the metadata according to different bytes limits (on metadata
> names, metadata values and global amount of metadata)
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)