Rishi Verma created OODT-848:
--------------------------------

             Summary: AutoDetectProductCrawler's mimeExtractorRepo argument 
overridden by Tika
                 Key: OODT-848
                 URL: https://issues.apache.org/jira/browse/OODT-848
             Project: OODT
          Issue Type: Bug
          Components: crawler, metadata container
    Affects Versions: 0.8.1
            Reporter: Rishi Verma
            Assignee: Rishi Verma
             Fix For: 0.9


AutoDetectProductCrawler [1] is not able to leverage customized extractors 
specified via the mimeExtractorRepo argument that use common file glob 
patterns. In other words, if the user has a custom "mime-extractor-map.xml" 
leveraging a custom "mime-types.xml" that maps specific glob patterns to 
specific extractors, this mapping will be overridden by Tika's default glob 
mappings if Tika finds a match internally. This leads to the fact that for many 
basic types of files, such as text files, AutoDetectProductCrawler will 
identify the mime type as "text/plain" no matter what mime type the user has 
specified within their own mime-types.xml. This is a problem if one has 
multiple extractors which need to filter for different types of text/plain 
files. 

I found this problem appeared when I updated from OODT 0.7 to 0.8.1, because 
OODT 0.7 used Tika 0.8 and 0.8.1 now uses Tika 1.7.

Recreating the problem:
1. Make a custom extractor that handles a file of type text/plain
2. In your mime-extractor-map.xml, add a mime type for your custom extractor
3. In your mime-types.xml, add a glob patter matching your file name pattern, 
to the mime type in (2)
4. Run crawler_launcher using AutoDetectProductCrawler, and you'll find that 
your text file will NOT match your extractor in OODT v0.8.1
i.e. OODT will tell you:
WARNING: No extractor specs specified for /your/text/file

Tracing the flow of the problem:
1. AutoDetectProductCrawler calls "passesPreconditions" method
2. AutoDetectProductCrawler#passesPreconditions calls 
MimeExtractorRepo#getExtractorSpecsForFile [2]
3. MimeExtractorRepo#getExtractorSpecsForFile calls MimeTypeUtils#getMimeType 
[3]
4. MimeTypeUtils#getMimeType calls Tika#detect, where MimeTypeUtils's 
constructor has loaded a Tika instance using DefaultDetector [4]
5. DefaultDetector#getDefaultDetectors [4] specifies that the user-provided 
mime-types.xml file must take LAST precedence. Thus, Tika's default, internal 
mime-type mappings will override mime-types.xml.

--
[1] 
https://github.com/apache/oodt/blob/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/AutoDetectProductCrawler.java
[2] 
https://github.com/apache/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/typedetection/MimeExtractorRepo.java
[3] 
https://github.com/apache/oodt/blob/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/util/MimeTypeUtils.java
[4] 
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to