[ 
https://issues.apache.org/jira/browse/TIKA-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160158#comment-17160158
 ] 

Hudson commented on TIKA-3137:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x-jdk8 #349 (See 
[https://builds.apache.org/job/tika-branch-1x-jdk8/349/])
TIKA-3137 -- first pass, need to add unit tests for tika-batch (tallison: 
[https://github.com/apache/tika/commit/db4498d1de534f8348e94b0f27c641353a26b083])
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/ClearByMimeMetadataFilter.java
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-3137-mimes-uc.xml
* (edit) 
tika-batch/src/test/java/org/apache/tika/batch/RecursiveParserWrapperFSConsumerTest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/sax/RecursiveParserWrapperHandler.java
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-3137-include-uc.xml
* (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/CompositeMetadataFilter.java
* (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/MetadataFilter.java
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/ExcludeFieldMetadataFilter.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java
* (add) 
tika-server/src/test/resources/org/apache/tika/server/TIKA-3137-include.xml
* (add) tika-core/src/main/java/org/apache/tika/metadata/filter/NoOpFilter.java
* (add) 
tika-parsers/src/test/resources/org/apache/tika/parser/TIKA-3137-include.xml
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/DefaultMetadataFilter.java
* (add) 
tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataFilterTest.java
* (add) 
tika-core/src/test/java/org/apache/tika/metadata/filter/TestMetadataFilter.java
* (add) 
tika-core/src/main/resources/META-INF/services/org.apache.tika.metadata.filter.MetadataFilter
* (edit) tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java
* (edit) tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-3137-include.xml
* (add) 
tika-core/src/main/java/org/apache/tika/metadata/filter/IncludeFieldMetadataFilter.java
* (add) 
tika-core/src/test/resources/org/apache/tika/config/TIKA-3137-exclude.xml
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* (edit) 
tika-batch/src/main/java/org/apache/tika/batch/fs/StreamOutRPWFSConsumer.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* (add) 
tika-core/src/test/java/org/apache/tika/metadata/filter/MockUpperCaseFilter.java
TIKA-3137 add a list type for Param/configuration to avoid the (tallison: 
[https://github.com/apache/tika/commit/096a4ad7f6ca7098a513138f5fc6338858efe07f])
* (edit) 
tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataFilterTest.java
* (edit) 
tika-server/src/test/resources/org/apache/tika/server/TIKA-3137-include.xml
* (edit) 
tika-parsers/src/test/resources/org/apache/tika/parser/TIKA-3137-include.xml


> Enable a metadata filter for the RecursiveParserWrapper
> -------------------------------------------------------
>
>                 Key: TIKA-3137
>                 URL: https://issues.apache.org/jira/browse/TIKA-3137
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.25
>
>
> The RecursiveParserWrapper is designed to extract all metadata from every 
> embedded file.  Some users may need more targeted ways of filtering the 
> metadata to save on resources, e.g. memory, disc or transfer-size/bandwidth 
> in tika-server.
> Some use cases that come to mind:
> * A user only wants the title, author and content fields.
> * A user doesn't want content from EMF files, but does want the content from 
> a PDF embedded inside an EMF file.
> * This could be an avenue for text-based enrichment, e.g. run NER on the 
> content field and add those recognized entities to the Metadata; or tika-eval 
> statistics...
> The last point may require further discussion.  We have some handlers that 
> require buffering the full text of a document and then running extraction 
> (Phone number extractor?).  The downside to this is that we're storing two 
> copies of the data in memory.  For at least the RPW, it would be more 
> efficient to do postprocessing on the one buffered copy.
> Some open questions: how do we configure the choice of filter(s), do we apply 
> this to the AutoDetectParser...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to