Re: Question related to improving search results

Erik Hatcher Sat, 02 May 2009 18:47:21 -0700

I suppose you're talking about content that is indexed from webcrawling. It's a messy problem. Extraneous junk needs to be filteredout and not indexed, so some form of header/footer/sidebar detectionand exclusion definitely makes searching crawled pages much better.

When possible, index clean content. In the case of wikipedia, you canget full dumps of the content without the templates, just the content.


        Erik

On May 2, 2009, at 6:48 AM, Aditya wrote:

Hi,

New to this group.

Question:
Generally sites like wikipeadia have a template and every pagefollows it. These templates contains the word that occurs in everypage.
For example wikipedia template has the list of language in the leftpanel. Now these words gets indexed every time since they are not(cannot be) stop words.if user for example search for "Galego", every wikipedia page willbe in the search result which is wrong as every wikipedia page doesnot talk about "Galego"
Any takes on this one for how to solve this problem?


Best Regards,
Aditya



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Question related to improving search results

Reply via email to