AutoDetectParser treats HTML/XML files as Audio
-----------------------------------------------
Key: TIKA-522
URL: https://issues.apache.org/jira/browse/TIKA-522
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.7
Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse
20100617-1415
Reporter: Dennis Adler
I am crawling an SMB share. I've used the steps outlined in Tika samples to
initialize; given a File object in f, my code is:
parser = new AutoDetectParser();
context.set(Parser.class, parser);
// Get the URL
URL url = f.toURI().toURL();
// Extract Metadata
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(-1); // -1 =
infinite size for XML string buffer (per file)
// Get the input stream
InputStream input = MetadataHelper.getInputStream(url, metadata);
// Parse the document
parser.parse(input, handler, metadata, context);
If I place a breakpoint right after the parser.parse invoke, I find the
metadata calling my input out as an Audio file. If I try to debug the parse
steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.
I have a half-baked workaround: I invoke Thread.sleep(5000) just after the
context.set invoke... in 3 sequential test runs that works fine. Problem is,
this was working fine several days ago without that (perhaps my computer was
busy with other things and the timing issue did not pop up then).
I have downloade and am building today's 0.8 from svn to see if that helps,
though I am concerned about the impacts to the rest of my testing if I have to
swtich to 0.8. Just understanding what was going on would be a huge help :)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.