[ 
https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Pessy updated ANY23-441:
--------------------------------
    Description: 
Using `TikaEncodingDetector.guessEncoding` may result in an 
`ArrayIndexOutOfBoundsException`.

 

The following snippet:
{noformat}
String encoding = new TikaEncodingDetector().guessEncoding(new 
URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());

System.out.println(encoding);{noformat}
Will result in the following exception:
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
out of bounds for length 32768Exception in thread "main" 
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at 
org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
Whereas the expected result is `ISO-8859-15`

Note the bunch of HTML at the bottom of the page after the `</html>` tag.

 

Replacing:
{code:java}
ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
{code}
By:
{code:java}
ParseErrorList htmlErrors = ParseErrorList.tracking(100);
{code}
 

Will fix the issue. Not quite sure why, maybe at one point the errors are too 
far and the reader cannot reset far enough...

 

 

  was:
Using `TikaEncodingDetector.guessEncoding` may result in an 
`ArrayIndexOutOfBoundsException`.

 

The following snippet:
{noformat}
String encoding = new TikaEncodingDetector().guessEncoding(new 
URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());

System.out.println(encoding);{noformat}
Will result in the following exception:
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
out of bounds for length 32768Exception in thread "main" 
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at 
org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
 at 
org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
Whereas the expected result is `ISO-8859-15`

 

I think the issue may be caused by the bunch of HTML at the bottom of the page 
after the `</html>` tag.


Maybe `parseBodyFragment` is not happy because of this? In truth I have a hard 
time finding the smallest reproduction case.

 

Anyhow, replacing `TikaEncodingDetector.parseFragment` method:
{code:java}
private static Document parseFragment(String html, ParseErrorList errors) {
    Document doc = new Document("");
    Node[] childNodes = Parser.parseFragment(html, null, "", 
errors).toArray(EMPTY_NODES);
    for (Node node : childNodes) {
        if (node.parentNode() != null) {
            node.remove();
        }
        doc.appendChild(node);
    }
    return doc;
}
{code}
By:
{code:java}
private static Document parseFragment(String html, ParseErrorList errors) {
    HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
    Parser parser = new Parser(treeBuilder);

    Document doc = parser.parseInput(html, "");

    errors.addAll(parser.getErrors());

    return doc;
}
{code}
Correct this issue but I'm not sure if I'm reintroducing new ones. I don't know 
why this method why implemented like this in the first place.

 

Note: Not quite sure if this issue belongs to JSoup or any23 but since using 
`Parser.parseInput` instead of `Parser.parseFragment` works I created the issue 
here.

 


> TikaEncodingDetector: guessEncoding may throws an 
> ArrayIndexOutOfBoundsException
> --------------------------------------------------------------------------------
>
>                 Key: ANY23-441
>                 URL: https://issues.apache.org/jira/browse/ANY23-441
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Anthony Pessy
>            Priority: Major
>
> Using `TikaEncodingDetector.guessEncoding` may result in an 
> `ArrayIndexOutOfBoundsException`.
>  
> The following snippet:
> {noformat}
> String encoding = new TikaEncodingDetector().guessEncoding(new 
> URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());
> System.out.println(encoding);{noformat}
> Will result in the following exception:
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
> out of bounds for length 32768Exception in thread "main" 
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
> 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) 
> at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
> org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
> org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
> org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
> org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
> org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
> Whereas the expected result is `ISO-8859-15`
> Note the bunch of HTML at the bottom of the page after the `</html>` tag.
>  
> Replacing:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
> {code}
> By:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(100);
> {code}
>  
> Will fix the issue. Not quite sure why, maybe at one point the errors are too 
> far and the reader cannot reset far enough...
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to