[ http://issues.apache.org/jira/browse/NUTCH-47?page=comments#action_63313 ] Doug Cutting commented on NUTCH-47: -----------------------------------
Google will also show you all of these sites when searching for "redhat". But perhaps we need a site-normalizer plugin. This could, e.g., return "redhat.com" for all of the above sites and would be used to determine what site each page is indexed as. > Configure host filter to do wildcard prefixes - *.redhat.com > ------------------------------------------------------------ > > Key: NUTCH-47 > URL: http://issues.apache.org/jira/browse/NUTCH-47 > Project: Nutch > Type: Improvement > Components: searcher > Environment: Linux > Reporter: byron miller > Priority: Minor > > Right now you can configure the max results per host for query response, but > that seems limited to exact host matches such as "www.redhat.com". > In many ways it would be nice to include the capability to match hosts by > wildcard. > For example search for redhat on mozdex.com: > http://www.mozdex.com/search.jsp?query=redhat > And you will see: > www.apac.redhat.com > www.europe.redhat.com > www.in.redhat.com > Could this be fixed so that *.redhat.com is under "find more sources under > redhat.com" or something like that? > I may be able to tweak the other processes, but i can envision a problem of > people creating www1 www2 www3 or using other country codes for the > same/similar content filling up pages of serps for what could be other > relevent information. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
