[ 
https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872388#action_12872388
 ] 

Andrew Khoury commented on TIKA-434:
------------------------------------

Here are some related posts to the tagsoup user groups:
http://groups.google.com/group/tagsoup-friends/browse_thread/thread/751d271c107a24a9#

http://webcache.googleusercontent.com/search?q=cache:M2F_jS2hLVwJ:tech.groups.yahoo.com/group/tagsoup-friends/message/1250+%22Yes,+it+should+be+handled+%28and+returned+as+a+raw+%26,+to+be+escaped%22+on+output+as+%26amp%3B&cd=1&hl=en&ct=clnk&gl=us

Evidently the bug occurs when the document contains a sequence of '&' followed 
by [CR].  When all CRs are transliterated to LFs then TagSoup runs properly.

As tagsoup has no official bug tracking or release tracking system there is no 
way to know when this bug would be fixed.  That is why I'm submitting it here 
as it is causing a bug in apache tika. 

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: breezycove.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following 
> exception was received.  It looks like a bug in tagsoup 1.2 (if you search 
> the tagsoup yahoo group you can see that it may be caused by '&' characters 
> in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text 
> from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.html.htmlpar...@eba477
>        at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at 
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown
>  Source)
>        at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to