..wonder what happened with the attachements...here they go again.
> -----Ursprungligt meddelande----- > Fran: Ronnie Kolehmainen [mailto:[EMAIL PROTECTED]] > Skickat: den 30 januari 2003 14:15 > Till: [EMAIL PROTECTED] > Amne: Re: <no-index> or <index> > > > Michael, > > the HtmlDocument class supports ignoring tags, ie all text inside > specified > tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags) > method. Remember to also include "script" and "style" in this array along > with your custom tag names. > > Hope this is any help for you. > > See below for the message from an old thread. > > /Ronnie > > > >Hi > > > >I am looking for an HTMLParser which skips text tagged by > > > ><no-index> or something similar. This way I could exclude for > >instance a "global navigation section" within the HTML > > > ><no-index> > >International<br> > <Business<br> > >Science<br> > >... > ></no-index> > < > >It seems that the current demo/HTMLParser > >(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=ch > apter.inde > xing&toc=faq#q11) > >is not capable of doing something like that. > > > >Any pointers are very welcome. > > > >Thanks a lot > > > >Michael > > > > Message sent on dec 9 2002: > > > HI, > > these are the classes i use. I only use them to extract the text stuff, so > they don't have methods for getting document title and such. However text > extraction has worked fine for me. > > The HtmlParser main method takes a file path as argument and outputs the > contents to a file named html.txt - useful when testing. > > /Ronnie > > > > -----Ursprungligt meddelande----- > > Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] > > Skickat: den 7 december 2002 17:12 > > Till: Lucene Users List > > Amne: Re: SV: Indexing HTML > > > > > > I have had good experiences with nekoHTML parser. > > > > Otis > > > > --- Leo Galambos <[EMAIL PROTECTED]> wrote: > > > > I'm not sure this is a solution to your problem. However, it seems > > > that the > > > > HTMLParser used by the IndexHTML class has problems parsing the > > > document > > > > (there is a test class included in the jar): > > > > > > > > > > > > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar > > > > org.apache.lucene.demo.html.Test f01529.txt > > > > Title: Webcz.cz - Power of search > > > > Parse Aborted: Encountered "\'" at line 106, column 27. > > > > Was expecting one of: > > > > <ArgName> ... > > > > <TagEnd> ... > > > > /Ronnie > > > > > > Hi Ronnie! > > > > > > I know about it and the exception is handled well (see log file > > > below). I > > > have found a better example than 1529, try this: > > > http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go > > > throught > > > Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file > > > is > > > specific, i.e. it has two titles, two base tags etc. > > > > > > I have not debugger here, so I cannot find the line where is the bug. > > > If > > > you try your magic, please, let me know about the patch. :) THX > > > > > > -g- > > > > > > > > > > > > adding save/d00320/f01516.html > > > Parse Aborted: Lexical error at line 68, column 11. Encountered: > > > "\u0178" > > > (376), after : "" > > > : > > > adding save/d00320/f01527.html > > > Parse Aborted: Encountered "=" at line 83, column 48. > > > Was expecting one of: > > > <ArgName> ... > > > <TagEnd> ... > > > > > > adding save/d00320/f01528.html > > > > > > > > > > > > -- > > > To unsubscribe, e-mail: > > > <mailto:[EMAIL PROTECTED]> > > > For additional commands, e-mail: > > > <mailto:[EMAIL PROTECTED]> > > > > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > > http://mailplus.yahoo.com > > > > -- > > To unsubscribe, e-mail: > > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > > <mailto:[EMAIL PROTECTED]> > > > > > >
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
