[jira] [Created] (OODT-848) AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika

Rishi Verma (JIRA) Mon, 18 May 2015 17:26:10 -0700

Rishi Verma created OODT-848:
--------------------------------

             Summary: AutoDetectProductCrawler's mimeExtractorRepo argument 
overridden by Tika
                 Key: OODT-848
                 URL: https://issues.apache.org/jira/browse/OODT-848
             Project: OODT
          Issue Type: Bug
          Components: crawler, metadata container
    Affects Versions: 0.8.1
            Reporter: Rishi Verma
            Assignee: Rishi Verma
             Fix For: 0.9

AutoDetectProductCrawler [1] is not able to leverage customized extractors
specified via the mimeExtractorRepo argument that use common file glob
patterns. In other words, if the user has a custom "mime-extractor-map.xml"
leveraging a custom "mime-types.xml" that maps specific glob patterns to
specific extractors, this mapping will be overridden by Tika's default glob
mappings if Tika finds a match internally. This leads to the fact that for many
basic types of files, such as text files, AutoDetectProductCrawler will
identify the mime type as "text/plain" no matter what mime type the user has
specified within their own mime-types.xml. This is a problem if one has
multiple extractors which need to filter for different types of text/plain
files.

I found this problem appeared when I updated from OODT 0.7 to 0.8.1, because
OODT 0.7 used Tika 0.8 and 0.8.1 now uses Tika 1.7.

Recreating the problem:
1. Make a custom extractor that handles a file of type text/plain
2. In your mime-extractor-map.xml, add a mime type for your custom extractor
3. In your mime-types.xml, add a glob patter matching your file name pattern,
to the mime type in (2)
4. Run crawler_launcher using AutoDetectProductCrawler, and you'll find that
your text file will NOT match your extractor in OODT v0.8.1
i.e. OODT will tell you:
WARNING: No extractor specs specified for /your/text/file

Tracing the flow of the problem:
1. AutoDetectProductCrawler calls "passesPreconditions" method
2. AutoDetectProductCrawler#passesPreconditions calls
MimeExtractorRepo#getExtractorSpecsForFile [2]
3. MimeExtractorRepo#getExtractorSpecsForFile calls MimeTypeUtils#getMimeType
[3]
4. MimeTypeUtils#getMimeType calls Tika#detect, where MimeTypeUtils's
constructor has loaded a Tika instance using DefaultDetector [4]
5. DefaultDetector#getDefaultDetectors [4] specifies that the user-provided
mime-types.xml file must take LAST precedence. Thus, Tika's default, internal
mime-type mappings will override mime-types.xml.

--
[1]
https://github.com/apache/oodt/blob/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/AutoDetectProductCrawler.java
[2]
https://github.com/apache/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/typedetection/MimeExtractorRepo.java
[3]
https://github.com/apache/oodt/blob/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/util/MimeTypeUtils.java
[4]
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (OODT-848) AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika

Reply via email to