[
https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974545#comment-15974545
]
Tim Allison commented on TIKA-2328:
-----------------------------------
[~shaie] thank you for raising this. Y, I don't think there's much we can do
except migrate to a more active project TIKA-1599???
> HtmlParser fails when DOCTYPE has unbalanced quotes
> ---------------------------------------------------
>
> Key: TIKA-2328
> URL: https://issues.apache.org/jira/browse/TIKA-2328
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Shai Erera
>
> When attempting to parse HTML documents that start like this:
> {noformat}
> <!DOCTYPE HTML PUBLIC ">
> <head>
> <HEAD>
> <title>PolClub - Polish Page on VicNet - Australia</title>
> {noformat}
> I receive the following exception:
> {noformat}
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String
> index out of range: -1
> at java.lang.String.substring(String.java:1967)
> at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
> at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
> {noformat}
> The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
> {code}
> private static String trimquotes(String in) {
> if (in == null) return in;
> int length = in.length();
> if (length == 0) return in;
> char s = in.charAt(0);
> char e = in.charAt(length - 1);
> if (s == e && (s == '\'' || s == '"')) {
> in = in.substring(1, in.length() - 1);
> }
> return in;
> }
> {code}
> Instead of checking for string of length 0, it should check {{if length <= 1)
> return in;}}, as even if the string is of length 1, there's no point trimming
> the quotes. Or, if the desired behavior is to remove the leading quotes only,
> better protect against this case.
> I know the bug is in tagsoup, but it looks like the code hasn't been touched
> in 6 years. I hope it's OK to report the bug here.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)