Configure host filter to do wildcard prefixes - *.redhat.com
------------------------------------------------------------
Key: NUTCH-47
URL: http://issues.apache.org/jira/browse/NUTCH-47
Project: Nutch
Type: Improvement
Components: searcher
Environment: Linux
Reporter: byron miller
Priority: Minor
Right now you can configure the max results per host for query response, but
that seems limited to exact host matches such as "www.redhat.com".
In many ways it would be nice to include the capability to match hosts by
wildcard.
For example search for redhat on mozdex.com:
http://www.mozdex.com/search.jsp?query=redhat
And you will see:
www.apac.redhat.com
www.europe.redhat.com
www.in.redhat.com
Could this be fixed so that *.redhat.com is under "find more sources under
redhat.com" or something like that?
I may be able to tweak the other processes, but i can envision a problem of
people creating www1 www2 www3 or using other country codes for the
same/similar content filling up pages of serps for what could be other relevent
information.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira