I have had good experiences with nekoHTML parser. Otis
--- Leo Galambos <[EMAIL PROTECTED]> wrote: > > I'm not sure this is a solution to your problem. However, it seems > that the > > HTMLParser used by the IndexHTML class has problems parsing the > document > > (there is a test class included in the jar): > > > > > > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar > > org.apache.lucene.demo.html.Test f01529.txt > > Title: Webcz.cz - Power of search > > Parse Aborted: Encountered "\'" at line 106, column 27. > > Was expecting one of: > > <ArgName> ... > > <TagEnd> ... > > /Ronnie > > Hi Ronnie! > > I know about it and the exception is handled well (see log file > below). I > have found a better example than 1529, try this: > http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go > throught > Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file > is > specific, i.e. it has two titles, two base tags etc. > > I have not debugger here, so I cannot find the line where is the bug. > If > you try your magic, please, let me know about the patch. :) THX > > -g- > > > > adding save/d00320/f01516.html > Parse Aborted: Lexical error at line 68, column 11. Encountered: > "\u0178" > (376), after : "" > : > adding save/d00320/f01527.html > Parse Aborted: Encountered "=" at line 83, column 48. > Was expecting one of: > <ArgName> ... > <TagEnd> ... > > adding save/d00320/f01528.html > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
