1) When will you add to config away to detect clone sites (sites with same content at difference URL's) and group sites together like google does (e.i. [ More results from www.site.com ] ).
The 'dedup' command removes pages with identical content. The 'crawl' command uses this automatically. But slightly different pages at different sites are not removed. This is a harder problem, although viable techniques have been published, e.g.:
http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html
Group-by-site is a feature we'd like to add. This has been in our bug database for nearly a year:
http://sourceforge.net/tracker/index.php?func=detail&aid=752168&group_id=59548&atid=491356
Would you like to help add it?
2) When will you add more Regular Expressions:
site: within a site links: to check how many people are linking to your site
I'm currently in the process of making it much easier to do such things. These should be completed in around a month or so.
3) I think it would be great to add a way for the spider to look for zip codes too, so searcher can limit their search to areas, ZIP:97004 hosting
That would indeed be great, but first one would need to associate a zip code with each page. Do you know how to do this?
Cheers,
Doug
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
