[jira] [Commented] (TIKA-3695) LimitingMetadataFilter

Julien Massiera (Jira) Fri, 18 Mar 2022 12:24:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509041#comment-17509041
 ]


Julien Massiera commented on TIKA-3695:
---------------------------------------

I am not sure I understood how it works, I have configured the following in my 
tika-config.xml file for my tika-server-standard built with the commit 
'000abdcf70112df1a2a9a433e308c1fe5db1d45e'

 
{code:java}
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
<autoDetectParserConfig>
  <metadataWriteFilterFactory 
class="org.apache.tika.metadata.StandardWriteFilterFactory">
    <params>
      <maxEstimatedBytes>20</maxEstimatedBytes>
    </params>
  </metadataWriteFilterFactory>
</autoDetectParserConfig> {code}
 

Then I sent a docx containing a description metadata with more than 50k chars 
with the following command:

 
{code:java}
curl -H "writeLimit:1000000" -T bigmetadata.docx 
http://localhost:9998/rmeta/text > tika-extract.json {code}
 

In the tika-extract.json result, I get the dc:description metadata containing 
the 50k chars and I don't have the tika exception flag "metadata_limit_reached"

What is wrong in what I am doing ?   

 

 

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>             Fix For: 2.3.1
>
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TIKA-3695) LimitingMetadataFilter

Reply via email to