Configure host filter to do wildcard prefixes - *.redhat.com
------------------------------------------------------------

         Key: NUTCH-47
         URL: http://issues.apache.org/jira/browse/NUTCH-47
     Project: Nutch
        Type: Improvement
  Components: searcher  
 Environment: Linux
    Reporter: byron miller
    Priority: Minor


Right now you can configure the max results per host for query response, but 
that seems limited to exact host matches such as "www.redhat.com".

In many ways it would be nice to include the capability to match hosts by 
wildcard.

For example search for redhat on mozdex.com:

http://www.mozdex.com/search.jsp?query=redhat

And you will see:

www.apac.redhat.com 
www.europe.redhat.com 
www.in.redhat.com 

Could this be fixed so that *.redhat.com is under "find more sources under 
redhat.com" or something like that?

I may be able to tweak the other processes, but i can envision a problem of 
people creating www1 www2 www3 or using other country codes for the 
same/similar content filling up pages of serps for what could be other relevent 
information.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to