[ https://issues.apache.org/jira/browse/NUTCH-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-47: ------------------------------- Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira > Configure host filter to do wildcard prefixes - *.redhat.com > ------------------------------------------------------------ > > Key: NUTCH-47 > URL: https://issues.apache.org/jira/browse/NUTCH-47 > Project: Nutch > Issue Type: Improvement > Components: searcher > Environment: Linux > Reporter: byron miller > Priority: Minor > > Right now you can configure the max results per host for query response, but > that seems limited to exact host matches such as "www.redhat.com". > In many ways it would be nice to include the capability to match hosts by > wildcard. > For example search for redhat on mozdex.com: > http://www.mozdex.com/search.jsp?query=redhat > And you will see: > www.apac.redhat.com > www.europe.redhat.com > www.in.redhat.com > Could this be fixed so that *.redhat.com is under "find more sources under > redhat.com" or something like that? > I may be able to tweak the other processes, but i can envision a problem of > people creating www1 www2 www3 or using other country codes for the > same/similar content filling up pages of serps for what could be other > relevent information. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira