Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki
MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host

FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists

2009-12-03 Thread BELLINI ADAM
hi, i'm performing a RECRAWL using the recrawl.sh script, and i had this error when inverting the links: FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists echo "- Invert Links (Step 4 of $steps) -" $NUTCH_HOME/bin/nutch invertlinks $crawl/

Re: How does generate work ?

2009-12-03 Thread MilleBii
Hum... I use the max urls and sets it to 600... Because in the worst case you have 6s (measured on logs) in between urls of same host: so 6 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last longer than 1hour... Unfortunately it is not what I see I also tried the " by.ip" opti

Re: How does generate work ?

2009-12-03 Thread Julien Nioche
> Hum... I use the max urls and sets it to 600... Because in the worst > case you have 6s (measured on logs) in between urls of same host: so 6 > x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last > longer than 1hour... Unfortunately it is not what I see that's assuming that a

Re: How does generate work ?

2009-12-03 Thread MilleBii
Well I increased dramatically the number of threads, empirically the best I have found is around 1200 threads. This actually means 2400 because I have two mappers running at once (looking at hadoop logs). The bandwidth still gets a 'L' shape... Although a lot higher and a bit thicker. On the run

db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM
hi, i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly db.fetch.interval.default 18000 The number of seconds between re-fetches of a page (5 hours ). the first crawl when the crawl directory $cra

Re: db.fetch.interval.default

2009-12-03 Thread reinhard schwab
hi, i have identified one source of such a problem and opened an issue at jira. you can apply this patch and check whether is solves your problem. https://issues.apache.org/jira/browse/NUTCH-774 btw you can check also your crawldb for such items - the retry interval is set to 0. just dump the cr

RE: db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM
hi i dumped the database, and this is what i found: Status: 1 (db_unfetched) Fetch time: Thu Dec 03 15:53:24 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 2.0549393 Signature: null Metadata: so if meeting this url se

Re: db.fetch.interval.default

2009-12-03 Thread reinhard schwab
the crawl date here has state db_unfetched. it has not been fetched. are you sure that you dont have crawl dates with retry interval 0 seconds? grep Retry crawldump | grep -v "18000" BELLINI ADAM schrieb: > hi > i dumped the database, and this is what i found: > > > Status: 1 (db_unfetched) > Fet

Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?

2009-12-03 Thread J.G.Konrad
Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'? Shouldn't the crawldb entry have a status of 'db_gone'? This is happening in nutch-1.0 Here is one example of what I'm talking about = [jkon...@rampage search]$ ./bin/nutch readseg