[ 
https://issues.apache.org/jira/browse/OODT-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved OODT-848.
------------------------------------
    Resolution: Won't Fix

likely worked around

> AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika
> ------------------------------------------------------------------------
>
>                 Key: OODT-848
>                 URL: https://issues.apache.org/jira/browse/OODT-848
>             Project: OODT
>          Issue Type: Bug
>          Components: crawler, metadata container
>    Affects Versions: 0.8.1
>            Reporter: Rishi Verma
>            Assignee: Rishi Verma
>             Fix For: 1.1
>
>
> AutoDetectProductCrawler [1] is not able to leverage customized extractors 
> specified via the mimeExtractorRepo argument that use common file glob 
> patterns. In other words, if the user has a custom "mime-extractor-map.xml" 
> leveraging a custom "mime-types.xml" that maps specific glob patterns to 
> specific extractors, this mapping will be overridden by Tika's default glob 
> mappings if Tika finds a match internally. This leads to the fact that for 
> many basic types of files, such as text files, AutoDetectProductCrawler will 
> identify the mime type as "text/plain" no matter what mime type the user has 
> specified within their own mime-types.xml. This is a problem if one has 
> multiple extractors which need to filter for different types of text/plain 
> files. 
> I found this problem appeared when I updated from OODT 0.7 to 0.8.1, because 
> OODT 0.7 used Tika 0.8 and 0.8.1 now uses Tika 1.7.
> Recreating the problem:
> 1. Make a custom extractor that handles a file of type text/plain
> 2. In your mime-extractor-map.xml, add a mime type for your custom extractor
> 3. In your mime-types.xml, add a glob patter matching your file name pattern, 
> to the mime type in (2)
> 4. Run crawler_launcher using AutoDetectProductCrawler, and you'll find that 
> your text file will NOT match your extractor in OODT v0.8.1
> i.e. OODT will tell you:
> WARNING: No extractor specs specified for /your/text/file
> Tracing the flow of the problem:
> 1. AutoDetectProductCrawler calls "passesPreconditions" method
> 2. AutoDetectProductCrawler#passesPreconditions calls 
> MimeExtractorRepo#getExtractorSpecsForFile [2]
> 3. MimeExtractorRepo#getExtractorSpecsForFile calls MimeTypeUtils#getMimeType 
> [3]
> 4. MimeTypeUtils#getMimeType calls Tika#detect, where MimeTypeUtils's 
> constructor has loaded a Tika instance using DefaultDetector [4]
> 5. DefaultDetector#getDefaultDetectors [4] specifies that the user-provided 
> mime-types.xml file must take LAST precedence. Thus, Tika's default, internal 
> mime-type mappings will override mime-types.xml.
> --
> [1] 
> https://github.com/apache/oodt/blob/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/AutoDetectProductCrawler.java
> [2] 
> https://github.com/apache/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/typedetection/MimeExtractorRepo.java
> [3] 
> https://github.com/apache/oodt/blob/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/util/MimeTypeUtils.java
> [4] 
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to