[
https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919019#action_12919019
]
Dennis Adler commented on TIKA-522:
-----------------------------------
We are still seeing this repro irregularly. My best guess is some timing/class
loading issue but I really have no way to know. We do have a workaround in the
meantime. The following code replaces the IF at line 557 in MimeTypes.java:
//UPDATE: To work around the type.name.startsWith("audio")
problem with HTML files, what we'll
// do here instead is ACCEPT THE HINT from the name
(which is based on the extension)
// in lieu of the auto-detected type. May cause
problems downstream in detectors, which
// will be handled on the CATCH we put around our
parser calls. Thus we allow the extension
// to override the readMagicHeader-derived
type if the extension results in a non-default
// MIME type.
//if (hint.isDescendantOf(type)) {
if (hint != null && hint.getName() != null &&
hint.getName().length() > 0 && hint != rootMimeType) {
type = hint;
Of course now the Maven unit tests fail (until I get around to disabling the
MimeType tests aimed at catching the magic header / extension mismatch issue).
If I learn anything more about how/why this happens I will post that info. I'll
also monitor this issue to see if anyone else can shed some lite on the problem.
> AutoDetectParser treats HTML/XML files as Audio
> -----------------------------------------------
>
> Key: TIKA-522
> URL: https://issues.apache.org/jira/browse/TIKA-522
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse
> 20100617-1415
> Reporter: Dennis Adler
> Assignee: Ken Krugler
> Attachments: Tika MimeTypes bug repro case.htm
>
>
> I am crawling an SMB share. I've used the steps outlined in Tika samples to
> initialize; given a File object in f, my code is:
> parser = new AutoDetectParser();
> context.set(Parser.class, parser);
> // Get the URL
> URL url = f.toURI().toURL();
> // Extract Metadata
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler(-1); // -1 =
> infinite size for XML string buffer (per file)
> // Get the input stream
> InputStream input = MetadataHelper.getInputStream(url, metadata);
> // Parse the document
> parser.parse(input, handler, metadata, context);
> If I place a breakpoint right after the parser.parse invoke, I find the
> metadata calling my input out as an Audio file. If I try to debug the parse
> steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.
> I have a half-baked workaround: I invoke Thread.sleep(5000) just after the
> context.set invoke... in 3 sequential test runs that works fine. Problem is,
> this was working fine several days ago without that (perhaps my computer was
> busy with other things and the timing issue did not pop up then).
> I have downloade and am building today's 0.8 from svn to see if that helps,
> though I am concerned about the impacts to the rest of my testing if I have
> to swtich to 0.8. Just understanding what was going on would be a huge help :)
> * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect
> invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to
> determine the Mime Type based on the first 8k of data. I did not trace into
> getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and
> "text/html" most others. I can supply the HTML file if desired.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.