[jira] [Created] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

Shai Erera (JIRA) Tue, 18 Apr 2017 07:46:00 -0700

Shai Erera created TIKA-2328:
--------------------------------

             Summary: HtmlParser fails when DOCTYPE has unbalanced quotes
                 Key: TIKA-2328
                 URL: https://issues.apache.org/jira/browse/TIKA-2328
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Shai Erera



When attempting to parse HTML documents that start like this:

{noformat}
<!DOCTYPE HTML PUBLIC ">
<head>
        <HEAD>
        <title>PolClub - Polish Page on VicNet - Australia</title>
{noformat}

I receive the following exception:

{noformat}
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
index out of range: -1
        at java.lang.String.substring(String.java:1967)
        at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
        at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
{noformat}

The problem seems to be in Tagsoup's {{Parser.trimquotes}}:

{code}
        private static String trimquotes(String in) {
                if (in == null) return in;
                int length = in.length();
                if (length == 0) return in;
                char s = in.charAt(0);
                char e = in.charAt(length - 1);
                if (s == e && (s == '\'' || s == '"')) {
                        in = in.substring(1, in.length() - 1);
                        }
                return in;
                }
{code}

Instead of checking for string of length 0, it should check {{if length <= 1) 
return in;}}, as even if the string is of length 1, there's no point trimming 
the quotes. Or, if the desired behavior is to remove the leading quotes only, 
better protect against this case.

I know the bug is in tagsoup, but it looks like the code hasn't been touched in 
6 years. I hope it's OK to report the bug here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

Reply via email to