Dear Massimo, I have a problem with it, if I discover a new 'bad site' it is will exists in the database. I can't remove it from db. How to detect these urls before inserting these into db? Have you any example regex to detect repeated dirs?
Thanks, Ferenc >Hi, >By removing urls loop by insert regex in regex-ulrfilter.txt my crawler >increase the speed of crawling by 50%. >From 40 pages/seconds with 120 threads to 74 pages/seconds. >The spider traps is realy a big problem for whole web crawlers. For >now the only solution is by observation of the urls inserted in the db >and create the appropriate regex. >Massimo
