Hi, We use Lucene to index and search HTML Documents. We extract all text content from the html documents and index it. While searching the documents, we found in several instances that search terms matched are in navbar section. Since it is in navbar, almost all pages in that site end up in search result.
Was wondering if there are any documented methods/heuristics to avoid searching certain portions of HTML document such as Navbars and footers. Technically, it is all HTML, so I assume that there is no straight-forward method to do that. I observed that search engines like Google donot do anything like the above and end up searching navbar and footer portions of the page too. I also understand that even if there are some heuristics, they are not likely to work with all html pages. Since navbar items are typically links, is it feasible to attach some weightage to different fragments of the text as it is retrieved (For e.g., if a text fragment is part of a link, give low priority compared to other fragments) and index accordingly ? Any pointers/clues ? Has some research been done on this subject ? thanks Ramesh -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
