Shai Erera created TIKA-2328:
--------------------------------
Summary: HtmlParser fails when DOCTYPE has unbalanced quotes
Key: TIKA-2328
URL: https://issues.apache.org/jira/browse/TIKA-2328
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Shai Erera
When attempting to parse HTML documents that start like this:
{noformat}
<!DOCTYPE HTML PUBLIC ">
<head>
<HEAD>
<title>PolClub - Polish Page on VicNet - Australia</title>
{noformat}
I receive the following exception:
{noformat}
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String
index out of range: -1
at java.lang.String.substring(String.java:1967)
at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
{noformat}
The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
{code}
private static String trimquotes(String in) {
if (in == null) return in;
int length = in.length();
if (length == 0) return in;
char s = in.charAt(0);
char e = in.charAt(length - 1);
if (s == e && (s == '\'' || s == '"')) {
in = in.substring(1, in.length() - 1);
}
return in;
}
{code}
Instead of checking for string of length 0, it should check {{if length <= 1)
return in;}}, as even if the string is of length 1, there's no point trimming
the quotes. Or, if the desired behavior is to remove the leading quotes only,
better protect against this case.
I know the bug is in tagsoup, but it looks like the code hasn't been touched in
6 years. I hope it's OK to report the bug here.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)