[ https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anthony Pessy updated ANY23-441: -------------------------------- Description: Using `TikaEncodingDetector.guessEncoding` may result in an `ArrayIndexOutOfBoundsException`. The following snippet: {noformat} String encoding = new TikaEncodingDetector().guessEncoding(new URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream()); System.out.println(encoding);{noformat} Will result in the following exception: {noformat} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at org.jsoup.parser.Parser.parseFragment(Parser.java:140) at org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat} Whereas the expected result is `ISO-8859-15` Note the bunch of HTML at the bottom of the page after the `</html>` tag. Replacing: {code:java} ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE); {code} By: {code:java} ParseErrorList htmlErrors = ParseErrorList.tracking(100); {code} Will fix the issue. Not quite sure why, maybe at one point the errors are too far and the reader cannot reset far enough... was: Using `TikaEncodingDetector.guessEncoding` may result in an `ArrayIndexOutOfBoundsException`. The following snippet: {noformat} String encoding = new TikaEncodingDetector().guessEncoding(new URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream()); System.out.println(encoding);{noformat} Will result in the following exception: {noformat} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at org.jsoup.parser.Parser.parseFragment(Parser.java:140) at org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat} Whereas the expected result is `ISO-8859-15` I think the issue may be caused by the bunch of HTML at the bottom of the page after the `</html>` tag. Maybe `parseBodyFragment` is not happy because of this? In truth I have a hard time finding the smallest reproduction case. Anyhow, replacing `TikaEncodingDetector.parseFragment` method: {code:java} private static Document parseFragment(String html, ParseErrorList errors) { Document doc = new Document(""); Node[] childNodes = Parser.parseFragment(html, null, "", errors).toArray(EMPTY_NODES); for (Node node : childNodes) { if (node.parentNode() != null) { node.remove(); } doc.appendChild(node); } return doc; } {code} By: {code:java} private static Document parseFragment(String html, ParseErrorList errors) { HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder(); Parser parser = new Parser(treeBuilder); Document doc = parser.parseInput(html, ""); errors.addAll(parser.getErrors()); return doc; } {code} Correct this issue but I'm not sure if I'm reintroducing new ones. I don't know why this method why implemented like this in the first place. Note: Not quite sure if this issue belongs to JSoup or any23 but since using `Parser.parseInput` instead of `Parser.parseFragment` works I created the issue here. > TikaEncodingDetector: guessEncoding may throws an > ArrayIndexOutOfBoundsException > -------------------------------------------------------------------------------- > > Key: ANY23-441 > URL: https://issues.apache.org/jira/browse/ANY23-441 > Project: Apache Any23 > Issue Type: Bug > Components: encoding > Affects Versions: 2.3 > Reporter: Anthony Pessy > Priority: Major > > Using `TikaEncodingDetector.guessEncoding` may result in an > `ArrayIndexOutOfBoundsException`. > > The following snippet: > {noformat} > String encoding = new TikaEncodingDetector().guessEncoding(new > URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream()); > System.out.println(encoding);{noformat} > Will result in the following exception: > {noformat} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 > out of bounds for length 32768Exception in thread "main" > java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length > 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) > at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at > org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at > org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at > org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at > org.jsoup.parser.Parser.parseFragment(Parser.java:140) at > org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat} > Whereas the expected result is `ISO-8859-15` > Note the bunch of HTML at the bottom of the page after the `</html>` tag. > > Replacing: > {code:java} > ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE); > {code} > By: > {code:java} > ParseErrorList htmlErrors = ParseErrorList.tracking(100); > {code} > > Will fix the issue. Not quite sure why, maybe at one point the errors are too > far and the reader cannot reset far enough... > > -- This message was sent by Atlassian Jira (v8.3.2#803003)