[
https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760147#comment-17760147
]
ASF GitHub Bot commented on TIKA-2328:
--------------------------------------
kkrugler commented on code in PR #1310:
URL: https://github.com/apache/tika/pull/1310#discussion_r1309367221
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/HtmlParser.java:
##########
@@ -146,7 +146,11 @@ metadata, getEncodingDetector(context))) {
parser.setContentHandler(new XHTMLDowngradeHandler(
new HtmlHandler(mapper, handler, metadata, context,
extractScripts)));
- parser.parse(reader.asInputSource());
+ try {
+ parser.parse(reader.asInputSource());
+ } catch (StringIndexOutOfBoundsException e) {
Review Comment:
@tballison - any thoughts on Tika's general strategy for what errors get
returned by parsers? E.g. is the trend towards always throwing a
`TikaException`, versus more specific errors?
> HtmlParser fails when DOCTYPE has unbalanced quotes
> ---------------------------------------------------
>
> Key: TIKA-2328
> URL: https://issues.apache.org/jira/browse/TIKA-2328
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Shai Erera
> Priority: Major
>
> When attempting to parse HTML documents that start like this:
> {noformat}
> <!DOCTYPE HTML PUBLIC ">
> <head>
> <HEAD>
> <title>PolClub - Polish Page on VicNet - Australia</title>
> {noformat}
> I receive the following exception:
> {noformat}
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String
> index out of range: -1
> at java.lang.String.substring(String.java:1967)
> at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
> at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
> {noformat}
> The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
> {code}
> private static String trimquotes(String in) {
> if (in == null) return in;
> int length = in.length();
> if (length == 0) return in;
> char s = in.charAt(0);
> char e = in.charAt(length - 1);
> if (s == e && (s == '\'' || s == '"')) {
> in = in.substring(1, in.length() - 1);
> }
> return in;
> }
> {code}
> Instead of checking for string of length 0, it should check {{if length <= 1)
> return in;}}, as even if the string is of length 1, there's no point trimming
> the quotes. Or, if the desired behavior is to remove the leading quotes only,
> better protect against this case.
> I know the bug is in tagsoup, but it looks like the code hasn't been touched
> in 6 years. I hope it's OK to report the bug here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)