MilleBii wrote:
Oops continuing previous mail.
So I wonder if there would be a better algorithm 'generate' which
would maintain a constant rate of host per 100 url ... Below a certain
threshold it stops or better starts including URLs of lower scores.
That's exactly how the max.urls.per.host
hi,
i'm performing a RECRAWL using the recrawl.sh script, and i had this error when
inverting the links:
FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file
crawl/linkdb/.locked already exists
echo "- Invert Links (Step 4 of $steps) -"
$NUTCH_HOME/bin/nutch invertlinks $crawl/
Hum... I use the max urls and sets it to 600... Because in the worst
case you have 6s (measured on logs) in between urls of same host: so 6
x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
longer than 1hour... Unfortunately it is not what I see
I also tried the " by.ip" opti
> Hum... I use the max urls and sets it to 600... Because in the worst
> case you have 6s (measured on logs) in between urls of same host: so 6
> x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
> longer than 1hour... Unfortunately it is not what I see
that's assuming that a
Well I increased dramatically the number of threads, empirically the
best I have found is around 1200 threads. This actually means 2400
because I have two mappers running at once (looking at hadoop logs).
The bandwidth still gets a 'L' shape... Although a lot higher and a bit thicker.
On the run
hi,
i'm crawling my intranet , and i have setted the db.fetch.interval.default to
be 5 hours, but it seens that it doesnt work correctly
db.fetch.interval.default
18000
The number of seconds between re-fetches of a page (5 hours ).
the first crawl when the crawl directory $cra
hi,
i have identified one source of such a problem and opened an issue at jira.
you can apply this patch and check whether is solves your problem.
https://issues.apache.org/jira/browse/NUTCH-774
btw you can check also your crawldb for such items - the retry interval
is set to 0.
just dump the cr
hi
i dumped the database, and this is what i found:
Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:
so if meeting this url se
the crawl date here has state db_unfetched.
it has not been fetched.
are you sure that you dont have crawl dates with retry interval 0 seconds?
grep Retry crawldump | grep -v "18000"
BELLINI ADAM schrieb:
> hi
> i dumped the database, and this is what i found:
>
>
> Status: 1 (db_unfetched)
> Fet
Why does a url with a fetch status of 'fetch_gone' show up as
'db_unfetched'? Shouldn't the crawldb entry have a status of
'db_gone'? This is happening in nutch-1.0
Here is one example of what I'm talking about
=
[jkon...@rampage search]$ ./bin/nutch readseg
10 matches
Mail list logo