Anthony Pessy created ANY23-441:
-----------------------------------
Summary: TikaEncodingDetector: guessEncoding may throws an
ArrayIndexOutOfBoundsException
Key: ANY23-441
URL: https://issues.apache.org/jira/browse/ANY23-441
Project: Apache Any23
Issue Type: Bug
Components: encoding
Affects Versions: 2.3
Reporter: Anthony Pessy
Using `TikaEncodingDetector.guessEncoding` may result in an
`ArrayIndexOutOfBoundsException`.
The following snippet:
{noformat}
String encoding = new TikaEncodingDetector().guessEncoding(new
URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream());
System.out.println(encoding);{noformat}
Will result in the following exception:
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1
out of bounds for length 32768Exception in thread "main"
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length
32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at
org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at
org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at
org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at
org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at
org.jsoup.parser.Parser.parseFragment(Parser.java:140) at
org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
at
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
at
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
at
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
Whereas the expected result is `ISO-8859-15`
I think the issue may be caused by the bunch of HTML at the bottom of the page
after the `</html>` tag.
Maybe `parseBodyFragment` is not happy because of this? In truth I have a hard
time finding the smallest reproduction case.
Anyhow, replacing `TikaEncodingDetector.parseFragment` method:
{code:java}
private static Document parseFragment(String html, ParseErrorList errors) {
Document doc = new Document("");
Node[] childNodes = Parser.parseFragment(html, null, "",
errors).toArray(EMPTY_NODES);
for (Node node : childNodes) {
if (node.parentNode() != null) {
node.remove();
}
doc.appendChild(node);
}
return doc;
}
{code}
By:
{code:java}
private static Document parseFragment(String html, ParseErrorList errors) {
HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
Parser parser = new Parser(treeBuilder);
Document doc = parser.parseInput(html, "");
errors.addAll(parser.getErrors());
return doc;
}
{code}
Correct this issue but I'm not sure if I'm reintroducing new ones. I don't know
why this method why implemented like this in the first place.
Note: Not quite sure if this issue belongs to JSoup or any23 but since using
`Parser.parseInput` instead of `Parser.parseFragment` works I created the issue
here.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)