[
https://issues.apache.org/jira/browse/TIKA-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001720#comment-13001720
]
Geoff Jarrad commented on TIKA-608:
-----------------------------------
This is one of the known problems with tagsoup. It is caused (if I recall
correctly) by reading two characters from the stream but attempting to push
back three.
Another bug relates to interpreting but not producing some entities of the form
&#number. There is also a third bug I know of, the nature of which I cannot
recall.
My own solution currently is to (repeatedly!) remove tagsoup from my local Tika
build and replace it with a bug-fixed version.
Last year I contacted the author, John Cowan ([email protected]), about an
update to tagsoup. He indicated that one might be forthcoming when he had the
time.
Perhaps the leaders of Tika could prevail upon Mr Cowan the importance of
having a bug-fixed update, given the wide use of Tika?
> IOException from tagsoup
> ------------------------
>
> Key: TIKA-608
> URL: https://issues.apache.org/jira/browse/TIKA-608
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Erik Hetzner
> Priority: Minor
> Attachments: test.html
>
>
> Attached HTML file causes IOexception from tagsoup.
> (Changing CR to LF fixes problem.)
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198:
> Illegal IOException from org.apache.tika.parser.html.HtmlParser@22b6d6ab
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> Caused by: java.io.IOException: Pushback buffer overflow
> at java.io.PushbackReader.unread(PushbackReader.java:138)
> at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 5 more
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira