Heuristics on searching HTML Documents ?

Mailing Lists Account Mon, 30 Dec 2002 03:00:06 -0800

Hi,

We use Lucene to index and search HTML Documents.  We extract
all text content from the html documents and index it.
While searching the documents, we found in several instances that
search terms matched are in navbar section. Since it is in navbar, almost
all pages in that site end up in search result.


Was wondering if there are any documented methods/heuristics to avoid
searching certain portions of HTML document such as Navbars and footers.

Technically, it is all HTML, so I assume that there is no straight-forward
method to
do that.  I observed that search engines like Google donot do anything like
the above
and end up searching navbar and footer portions of the page too.

I also understand that even if there are some heuristics, they are not
likely to work with
all html pages.

Since navbar items are typically links, is it feasible to attach some
weightage to different
fragments of the text as it is retrieved (For e.g., if a text fragment is
part of a link, give low priority
compared to other fragments) and index accordingly ?

Any pointers/clues ? Has some research been done on this subject ?

thanks
Ramesh




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Heuristics on searching HTML Documents ?

Reply via email to