> I'm not sure this is a solution to your problem. However, it seems that the > HTMLParser used by the IndexHTML class has problems parsing the document > (there is a test class included in the jar): > > > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar > org.apache.lucene.demo.html.Test f01529.txt > Title: Webcz.cz - Power of search > Parse Aborted: Encountered "\'" at line 106, column 27. > Was expecting one of: > <ArgName> ... > <TagEnd> ... > /Ronnie
Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: "\u0178" (376), after : "" : adding save/d00320/f01527.html Parse Aborted: Encountered "=" at line 83, column 48. Was expecting one of: <ArgName> ... <TagEnd> ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>