[
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510715#comment-17510715
]
Julien Massiera commented on TIKA-3695:
---------------------------------------
Indeed, I get the following result file on your huge-title.docx file:
{code:java}
[
{
"X-TIKA:Parsed-By": "org.apache.tika.parser.EmptyParser",
"X-TIKA:parse_time_millis": "17",
"X-TIKA:embedded_depth": "0",
"X-TIKA:WARN:truncated_metadata": "true",
"Content-Type": "application/vnd.openxmlformats-officedocument.word"
}
] {code}
Using the following configuration in tika-config.xml:
{code:java}
<autoDetectParserConfig>
<metadataWriteFilterFactory
class="org.apache.tika.metadata.writefilter.StandardWriteFilterFactory">
<params>
<maxKeySize>999</maxKeySize>
<maxFieldSize>100</maxFieldSize>
<maxTotalEstimatedBytes>500000</maxTotalEstimatedBytes>
</params>
</metadataWriteFilterFactory>
</autoDetectParserConfig> {code}
> LimitingMetadataFilter
> ----------------------
>
> Key: TIKA-3695
> URL: https://issues.apache.org/jira/browse/TIKA-3695
> Project: Tika
> Issue Type: New Feature
> Components: metadata
> Affects Versions: 1.28.1, 2.3.0
> Reporter: Julien Massiera
> Priority: Major
> Fix For: 2.4.0
>
> Attachments: huge-title.docx, tika-config.xml
>
>
> Some files may contain abnormally big metadata (several MB, be it for the
> metadata values, the metadata names, but also for the total amount of
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can
> filter out the metadata according to different bytes limits (on metadata
> names, metadata values and global amount of metadata)
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)