[ 
https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975418#comment-15975418
 ] 

Tim Allison commented on TIKA-2328:
-----------------------------------

Y, thank you for this data point!  My initial comparisons suggested six of one, 
half a dozen of another.  We now have a "common words" metric in Tika eval, and 
that _might_ give some indication of which one in general is better on 
CommonCrawl html.  However, for individual users with datastreams containing 
specific characteristics, one will be better than the other...

I'm really hoping we don't need a TikaSoup. :)  

> HtmlParser fails when DOCTYPE has unbalanced quotes
> ---------------------------------------------------
>
>                 Key: TIKA-2328
>                 URL: https://issues.apache.org/jira/browse/TIKA-2328
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shai Erera
>
> When attempting to parse HTML documents that start like this:
> {noformat}
> <!DOCTYPE HTML PUBLIC ">
> <head>
>               <HEAD>
>         <title>PolClub - Polish Page on VicNet - Australia</title>
> {noformat}
> I receive the following exception:
> {noformat}
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1
>       at java.lang.String.substring(String.java:1967)
>       at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
>       at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
>       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
>       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
> {noformat}
> The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
> {code}
>       private static String trimquotes(String in) {
>               if (in == null) return in;
>               int length = in.length();
>               if (length == 0) return in;
>               char s = in.charAt(0);
>               char e = in.charAt(length - 1);
>               if (s == e && (s == '\'' || s == '"')) {
>                       in = in.substring(1, in.length() - 1);
>                       }
>               return in;
>               }
> {code}
> Instead of checking for string of length 0, it should check {{if length <= 1) 
> return in;}}, as even if the string is of length 1, there's no point trimming 
> the quotes. Or, if the desired behavior is to remove the leading quotes only, 
> better protect against this case.
> I know the bug is in tagsoup, but it looks like the code hasn't been touched 
> in 6 years. I hope it's OK to report the bug here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to