Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool tak

YourSoft Wed, 20 Apr 2005 12:17:27 -0700

Dear Massimo,

I have a problem with it, if I discover a new 'bad site' it is will exists in 
the database.  I can't remove it from db.
How to detect these urls before inserting these into db? Have you any 
example regex to detect repeated dirs?


Thanks, Ferenc

>Hi,

>By removing urls loop by insert regex in regex-ulrfilter.txt my crawler 
>increase the speed of crawling  by 50%.
>From 40 pages/seconds with 120 threads to 74 pages/seconds.
>The spider traps is realy a big problem for whole web crawlers. For 
>now the only solution is by observation of the urls inserted in the db 
>and create the appropriate regex.

>Massimo

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool tak

Reply via email to