Erik Hatcher wrote:
If you look at the contributions/ant area of the Lucene sandbox in CVS you'll see my HtmlDocument class which uses JTidy.
Rather than making up some invalid HTML tag, I'd recommend you separate your navigation section with a <div> or <span> with a special class="navigation" or something like that. Then use JTidy to ignore such tags that have that class. Then you get valid, clean HTML and the ability to filter it for indexing.
Well, I haven't found out how to use JTidy to ignore such tags that have such a class. So I just
added some code to your class HtmlDocument within the getBodyText method:
if(child.getNodeName().equals("span")){
org.w3c.dom.Attr attribute=((Element)child).getAttributeNode("class");
if(attribute != null){
if(attribute.getValue().equals("lucene-no-index")){
System.out.println("HtmlDocument.getBodyText(): ignore span!");
break;
}
}
System.out.println("HtmlDocument.getBodyText(): accept span!");
}
This way text will be ignored within <span class="lucene-no-index">...</span>
It's not "perfect", but it's working very well for the moment.
Two remarks:
1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets:
contents= title + body
and your class HtmlDocument
contents=body
2) I got two Javadoc warnings, because @return was empty within HtmlDocument (getDocument() and Document())
Thanks very much for your help
Michael
Erik On Thursday, January 30, 2003, at 04:56 AM, Michael Wechner wrote:Hi
I am looking for an HTMLParser which skips text tagged by
<no-index> or something similar. This way I could exclude for
instance a "global navigation section" within the HTML
<no-index>
International<br>
Business<br>
Science<br>
...
</no-index>
It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi-bin/faq/ faqmanager.cgi?file=chapter.indexing&toc=faq#q11)
is not capable of doing something like that.
Any pointers are very welcome.
Thanks a lot
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
