Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool tak

YourSoft Wed, 20 Apr 2005 12:18:28 -0700

Dear Massimo,

I have a problem with it, if I discover a new 'bad site' it is will exists in 
the database.  I can't remove it from db.
How to detect these urls before inserting these into db? Have you any 
example regex to detect repeated dirs?


Thanks, Ferenc

>Hi,

>By removing urls loop by insert regex in regex-ulrfilter.txt my crawler 
>increase the speed of crawling  by 50%.
>From 40 pages/seconds with 120 threads to 74 pages/seconds.
>The spider traps is realy a big problem for whole web crawlers. For 
>now the only solution is by observation of the urls inserted in the db 
>and create the appropriate regex.

>Massimo



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool tak

Reply via email to