[jira] [Commented] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

ASF GitHub Bot (Jira) Tue, 29 Aug 2023 14:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760147#comment-17760147
 ]


ASF GitHub Bot commented on TIKA-2328:
--------------------------------------

kkrugler commented on code in PR #1310:
URL: https://github.com/apache/tika/pull/1310#discussion_r1309367221


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/HtmlParser.java:
##########
@@ -146,7 +146,11 @@ metadata, getEncodingDetector(context))) {
             parser.setContentHandler(new XHTMLDowngradeHandler(
                     new HtmlHandler(mapper, handler, metadata, context, 
extractScripts)));
 
-            parser.parse(reader.asInputSource());
+            try {
+                parser.parse(reader.asInputSource());
+            } catch (StringIndexOutOfBoundsException e) {

Review Comment:
   @tballison - any thoughts on Tika's general strategy for what errors get 
returned by parsers? E.g. is the trend towards always throwing a 
`TikaException`, versus more specific errors?





> HtmlParser fails when DOCTYPE has unbalanced quotes
> ---------------------------------------------------
>
>                 Key: TIKA-2328
>                 URL: https://issues.apache.org/jira/browse/TIKA-2328
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shai Erera
>            Priority: Major
>
> When attempting to parse HTML documents that start like this:
> {noformat}
> <!DOCTYPE HTML PUBLIC ">
> <head>
>               <HEAD>
>         <title>PolClub - Polish Page on VicNet - Australia</title>
> {noformat}
> I receive the following exception:
> {noformat}
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1
>       at java.lang.String.substring(String.java:1967)
>       at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
>       at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
>       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
>       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
> {noformat}
> The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
> {code}
>       private static String trimquotes(String in) {
>               if (in == null) return in;
>               int length = in.length();
>               if (length == 0) return in;
>               char s = in.charAt(0);
>               char e = in.charAt(length - 1);
>               if (s == e && (s == '\'' || s == '"')) {
>                       in = in.substring(1, in.length() - 1);
>                       }
>               return in;
>               }
> {code}
> Instead of checking for string of length 0, it should check {{if length <= 1) 
> return in;}}, as even if the string is of length 1, there's no point trimming 
> the quotes. Or, if the desired behavior is to remove the leading quotes only, 
> better protect against this case.
> I know the bug is in tagsoup, but it looks like the code hasn't been touched 
> in 6 years. I hope it's OK to report the bug here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

Reply via email to