[
https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler reassigned TIKA-881:
--------------------------------
Assignee: (was: Ken Krugler)
Looks like an InputStream issue, not something with HtmlParser. Inputstreams
should get "wrapped" by Tika such that a reset() will always work.
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
> Key: TIKA-881
> URL: https://issues.apache.org/jira/browse/TIKA-881
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.0
> Environment: Windows7, JDK1.5, JDK1.6
> Reporter: Klaus v. Einem
> Labels: stability
> Attachments: BugfixHtmlParser.java, HtmlParser.java
>
> Original Estimate: 0.5h
> Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out
> of 10 (approximately).
> java.io.IOException: Resetting to invalid mark
> at java.io.BufferedInputStream.reset(Unknown Source)
> at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> at
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read
> position is marked and the readlimit (maximum number of bytes to be read
> before the mark position gets invalidated) is given.
> So far so good, but then an InputStreamReader comes into play. When you check
> the API-Doc you see this:
> * ...
> * To enable the efficient conversion of bytes to characters, more bytes may
> * be read ahead from the underlying stream than are necessary to satisfy the
> * current read operation.
> * ...
> Please notice the term "may"... So, when this happens the following reset()
> on the stream will throw the Exception because the mark position gets
> invalidated (the number of read bytes exceeds the readlimit).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira