RE: or

Ronnie Kolehmainen Thu, 30 Jan 2003 05:22:36 -0800

..wonder what happened with the attachements...here they go again.


> -----Ursprungligt meddelande-----
> Fran: Ronnie Kolehmainen [mailto:[EMAIL PROTECTED]]
> Skickat: den 30 januari 2003 14:15
> Till: [EMAIL PROTECTED]
> Amne: Re: <no-index> or <index>
>
>
> Michael,
>
> the HtmlDocument class supports ignoring tags, ie all text inside
> specified
> tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags)
> method. Remember to also include "script" and "style" in this array along
> with your custom tag names.
>
> Hope this is any help for you.
>
> See below for the message from an old thread.
>
> /Ronnie
>
>
> >Hi
> >
> >I am looking for an HTMLParser which skips text tagged by
> >
> ><no-index>  or something similar. This way I could exclude for
> >instance a "global navigation section" within the HTML
> >
> ><no-index>
> >International<br>
> <Business<br>
> >Science<br>
> >...
> ></no-index>
> <
> >It seems that the current demo/HTMLParser
> >(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=ch
> apter.inde
> xing&toc=faq#q11)
> >is not capable of doing something like that.
> >
> >Any pointers are very welcome.
> >
> >Thanks a lot
> >
> >Michael
> >
>
> Message sent on dec 9 2002:
>
>
> HI,
>
> these are the classes i use. I only use them to extract the text stuff, so
> they don't have methods for getting document title and such. However text
> extraction has worked fine for me.
>
> The HtmlParser main method takes a file path as argument and outputs the
> contents to a file named html.txt - useful when testing.
>
> /Ronnie
>
>
> > -----Ursprungligt meddelande-----
> > Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
> > Skickat: den 7 december 2002 17:12
> > Till: Lucene Users List
> > Amne: Re: SV: Indexing HTML
> >
> >
> > I have had good experiences with nekoHTML parser.
> >
> > Otis
> >
> > --- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > > > I'm not sure this is a solution to your problem. However, it seems
> > > that the
> > > > HTMLParser used by the IndexHTML class has problems parsing the
> > > document
> > > > (there is a test class included in the jar):
> > > >
> > > >
> > > > >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> > > > org.apache.lucene.demo.html.Test f01529.txt
> > > > Title: Webcz.cz - Power of search
> > > > Parse Aborted: Encountered "\'" at line 106, column 27.
> > > > Was expecting one of:
> > > >     <ArgName> ...
> > > >     <TagEnd> ...
> > > > /Ronnie
> > >
> > > Hi Ronnie!
> > >
> > > I know about it and the exception is handled well (see log file
> > > below). I
> > > have found a better example than 1529, try this:
> > > http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
> > > throught
> > > Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
> > > is
> > > specific, i.e. it has two titles, two base tags etc.
> > >
> > > I have not debugger here, so I cannot find the line where is the bug.
> > > If
> > > you try your magic, please, let me know about the patch. :) THX
> > >
> > > -g-
> > >
> > >
> > >
> > > adding save/d00320/f01516.html
> > > Parse Aborted: Lexical error at line 68, column 11.  Encountered:
> > > "\u0178"
> > > (376), after : ""
> > > :
> > > adding save/d00320/f01527.html
> > > Parse Aborted: Encountered "=" at line 83, column 48.
> > > Was expecting one of:
> > >     <ArgName> ...
> > >     <TagEnd> ...
> > >
> > > adding save/d00320/f01528.html
> > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > > For additional commands, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > >
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> > http://mailplus.yahoo.com
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: or

Reply via email to