[
https://issues.apache.org/jira/browse/OODT-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann resolved OODT-848.
------------------------------------
Resolution: Won't Fix
likely worked around
> AutoDetectProductCrawler's mimeExtractorRepo argument overridden by Tika
> ------------------------------------------------------------------------
>
> Key: OODT-848
> URL: https://issues.apache.org/jira/browse/OODT-848
> Project: OODT
> Issue Type: Bug
> Components: crawler, metadata container
> Affects Versions: 0.8.1
> Reporter: Rishi Verma
> Assignee: Rishi Verma
> Fix For: 1.1
>
>
> AutoDetectProductCrawler [1] is not able to leverage customized extractors
> specified via the mimeExtractorRepo argument that use common file glob
> patterns. In other words, if the user has a custom "mime-extractor-map.xml"
> leveraging a custom "mime-types.xml" that maps specific glob patterns to
> specific extractors, this mapping will be overridden by Tika's default glob
> mappings if Tika finds a match internally. This leads to the fact that for
> many basic types of files, such as text files, AutoDetectProductCrawler will
> identify the mime type as "text/plain" no matter what mime type the user has
> specified within their own mime-types.xml. This is a problem if one has
> multiple extractors which need to filter for different types of text/plain
> files.
> I found this problem appeared when I updated from OODT 0.7 to 0.8.1, because
> OODT 0.7 used Tika 0.8 and 0.8.1 now uses Tika 1.7.
> Recreating the problem:
> 1. Make a custom extractor that handles a file of type text/plain
> 2. In your mime-extractor-map.xml, add a mime type for your custom extractor
> 3. In your mime-types.xml, add a glob patter matching your file name pattern,
> to the mime type in (2)
> 4. Run crawler_launcher using AutoDetectProductCrawler, and you'll find that
> your text file will NOT match your extractor in OODT v0.8.1
> i.e. OODT will tell you:
> WARNING: No extractor specs specified for /your/text/file
> Tracing the flow of the problem:
> 1. AutoDetectProductCrawler calls "passesPreconditions" method
> 2. AutoDetectProductCrawler#passesPreconditions calls
> MimeExtractorRepo#getExtractorSpecsForFile [2]
> 3. MimeExtractorRepo#getExtractorSpecsForFile calls MimeTypeUtils#getMimeType
> [3]
> 4. MimeTypeUtils#getMimeType calls Tika#detect, where MimeTypeUtils's
> constructor has loaded a Tika instance using DefaultDetector [4]
> 5. DefaultDetector#getDefaultDetectors [4] specifies that the user-provided
> mime-types.xml file must take LAST precedence. Thus, Tika's default, internal
> mime-type mappings will override mime-types.xml.
> --
> [1]
> https://github.com/apache/oodt/blob/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/AutoDetectProductCrawler.java
> [2]
> https://github.com/apache/trunk/crawler/src/main/java/org/apache/oodt/cas/crawl/typedetection/MimeExtractorRepo.java
> [3]
> https://github.com/apache/oodt/blob/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/util/MimeTypeUtils.java
> [4]
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)