[
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541565
]
Doğacan Güney commented on NUTCH-574:
-------------------------------------
Dennis, it seems you forgot to add the java files for the index-anchor plugin
:) Also, I am not sure why you are modifying build.xml...
Btw, it may be a good time to remove "throws IOException" from
inlinks.getAnchors since no path in getAnchors throws an IOException.
> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
> Key: NUTCH-574
> URL: https://issues.apache.org/jira/browse/NUTCH-574
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Environment: All, basic indexing filter
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given
> URL in the index. This sometimes allows pages to show up in search results
> where they may not be relevant. An example of this is a search for "dallas
> hotels" in our production index (www.visvo.com). Google would show up first
> in this example although there is no text matching either dallas or hotels on
> the google home page. What is happening here is there are inlinks into
> google with the words dallas and hotels which get included in the index for
> google.com and because google would have a very high boost due to inlinks,
> google shows up first for these search terms. I propose we add an option to
> allow/prevent inlink anchor text from being included in the index and set the
> default for this option to NOT include inbound link anchor text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.