[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540875
 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

I am ok with ignoring it if doesn't appear on the page text.  My goal here is 
to eliminate irrelevant sites from showing in search results.  I will create a 
patch that handles be checking parse text and ignoring if it doesn't appear in 
the page.  Sound good?

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given 
> URL in the index.  This sometimes allows pages to show up in search results 
> where they may not be relevant.  An example of this is a search for "dallas 
> hotels" in our production index (www.visvo.com).  Google would show up first 
> in this example although there is no text matching either dallas or hotels on 
> the google home page.  What is happening here is there are inlinks into 
> google with the words dallas and hotels which get included in the index for 
> google.com and because google would have a very high boost due to inlinks, 
> google shows up first for these search terms.  I propose we add an option to 
> allow/prevent inlink anchor text from being included in the index and set the 
> default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to