Ronnie Kolehmainen wrote:

Michael,

the HtmlDocument class supports ignoring tags, ie all text inside specified
tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags)
method. Remember to also include "script" and "style" in this array along
with your custom tag names.

I am not able to find the method setIgnoreTags() (I have updated my jakarta-lucene and
jakarta-lucene-sandbox). Or would that have been within the attachment? I guess the attachments
are skiped by the mailing list server.

I am now using Erik's code from sandbox.

Anyway, thanks a lot for your help

Michael


Hope this is any help for you.

See below for the message from an old thread.

/Ronnie



Hi

I am looking for an HTMLParser which skips text tagged by

<no-index> or something similar. This way I could exclude for
instance a "global navigation section" within the HTML

<no-index>
International<br>

<Business<br>

Science<br>
...
</no-index>

<

It seems that the current demo/HTMLParser
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.inde

xing&toc=faq#q11)

is not capable of doing something like that.

Any pointers are very welcome.

Thanks a lot

Michael


Message sent on dec 9 2002:


HI,

these are the classes i use. I only use them to extract the text stuff, so
they don't have methods for getting document title and such. However text
extraction has worked fine for me.

The HtmlParser main method takes a file path as argument and outputs the
contents to a file named html.txt - useful when testing.

/Ronnie



-----Ursprungligt meddelande-----
Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Skickat: den 7 december 2002 17:12
Till: Lucene Users List
Amne: Re: SV: Indexing HTML


I have had good experiences with nekoHTML parser.

Otis

--- Leo Galambos <[EMAIL PROTECTED]> wrote:

I'm not sure this is a solution to your problem. However, it seems

that the

HTMLParser used by the IndexHTML class has problems parsing the

document

(there is a test class included in the jar):



java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar

org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered "\'" at line 106, column 27.
Was expecting one of:
<ArgName> ...
<TagEnd> ...
/Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file
below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug.
If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11. Encountered:
"\u0178"
(376), after : ""
:
adding save/d00320/f01527.html
Parse Aborted: Encountered "=" at line 83, column 48.
Was expecting one of:
<ArgName> ...
<TagEnd> ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>




------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to