[ 
https://issues.apache.org/jira/browse/TIKA-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508215#comment-17508215
 ] 

Tim Allison edited comment on TIKA-3695 at 3/17/22, 3:01 PM:
-------------------------------------------------------------

Working branch: https://github.com/apache/tika/tree/TIKA-3695

Basic work is done here: 
https://github.com/apache/tika/blob/TIKA-3695/tika-core/src/main/java/org/apache/tika/metadata/StandardWriteFilter.java

Note that fields that are in ALWAYS_INCLUDE_FIELDS are always included and do 
not count towards the max size.

For how to configure it:
https://github.com/apache/tika/blob/TIKA-3695/tika-core/src/test/resources/org/apache/tika/config/TIKA-3695.xml

Can also add something like this to write only these fields.
{noformat}
<includeFields><field>dc:title</field><field>dc:creator</field></includeFields>
{noformat}

Simple unit test: 
https://github.com/apache/tika/blob/TIKA-3695/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java#L406


was (Author: [email protected]):
Working branch: https://github.com/apache/tika/tree/TIKA-3695

Basic work is done here: 
https://github.com/apache/tika/blob/TIKA-3695/tika-core/src/main/java/org/apache/tika/metadata/StandardWriteFilter.java

For how to configure it:
https://github.com/apache/tika/blob/TIKA-3695/tika-core/src/test/resources/org/apache/tika/config/TIKA-3695.xml

Can also add something like this to write only these fields.
{noformat}
<includeFields><field>dc:title</field><field>dc:creator</field></includeFields>
{noformat}

> LimitingMetadataFilter
> ----------------------
>
>                 Key: TIKA-3695
>                 URL: https://issues.apache.org/jira/browse/TIKA-3695
>             Project: Tika
>          Issue Type: New Feature
>          Components: metadata
>    Affects Versions: 1.28.1, 2.3.0
>            Reporter: Julien Massiera
>            Priority: Major
>
> Some files may contain abnormally big metadata (several MB, be it for the 
> metadata values, the metadata names, but also for the total amount of 
> metadata) that can be problematic concerning the memory consumption.
> It would be great to develop a new LimitingMetadataFilter so that we can 
> filter out the metadata according to different bytes limits (on metadata 
> names, metadata values and global amount of metadata) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to