Dear Massimo, I have a problem with it, if I discover a new 'bad site' it is will exists in the database. I can't remove it from db. How to detect these urls before inserting these into db? Have you any example regex to detect repeated dirs?
Thanks, Ferenc >Hi, >By removing urls loop by insert regex in regex-ulrfilter.txt my crawler >increase the speed of crawling by 50%. >From 40 pages/seconds with 120 threads to 74 pages/seconds. >The spider traps is realy a big problem for whole web crawlers. For >now the only solution is by observation of the urls inserted in the db >and create the appropriate regex. >Massimo ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
