Ken:

Thank you very much for the info, I applied it my testing enviornment
and I could see big changes in my bandwidth utilization. I have tried
it on a simple server and i could get a rather constant 25-29
pages/sec in a vertical crawl. Previously I was getting about 5-7
pages/sec.

Cheers
Zaheed


On 7/11/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
>On 6/28/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
>>Hi Doug,
>>
>>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>>>running into a similar problem.
>>
>>We wound up dramatically increasing the number of threads, which
>>seemed to help solve the bandwidth utilization problem. With Nutch
>>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
>>it's more like 2000+ threads...though you have to reduce the thread
>>stack size in this type of configuration.
>
>Hi Ken
>
>Could you please give me some clue regarding the stack size you are
>seeing the best bandwidth utilization...

Note that stack size twiddling is only done to allow for increasing
the number of fetcher threads without running of out JVM or OS memory.

>  I have the following
>
>core file size          (blocks, -c) 0
>data seg size           (kbytes, -d) unlimited
>max nice                        (-e) 20
>file size               (blocks, -f) unlimited
>pending signals                 (-i) unlimited
>max locked memory       (kbytes, -l) unlimited
>max memory size         (kbytes, -m) unlimited
>open files                      (-n) 1024
>pipe size            (512 bytes, -p) 8
>POSIX message queues     (bytes, -q) unlimited
>max rt priority                 (-r) unlimited
>stack size              (kbytes, -s) 8192
>cpu time               (seconds, -t) unlimited
>max user processes              (-u) unlimited
>virtual memory          (kbytes, -v) unlimited
>file locks                      (-x) unlimited
>
>What stack size should I play with the default seems to be 8192kb ?

We use something like ulimit -s 512 to set a 512K stack size at the OS level.

>also any onther parameters I should tweak?

We specify -Xss512K when running the fetch map-reduce task to set the
stack size in the JVM. But I don't remember off the top of my head
which of the many different config files this gets set in. Stefan?
>
>I often get too many open
>files problem

That's a separate issue.

>and I never could use my full bandwidth.. I am using
>about 10% of my bandwidth. I have played around with ulimit -n "very
>high number" which solves the "too many open files" but its not
>utilizing all my bandwidth, any help will be very much appreciated.

Try increasing the number of fetcher threads and reducing the stack
size. With 10 high-end servers in a cluster, we were able to max out
a 100mbs connection for brief periods, though as our crawl converged
(because it's a vertical crawl) the max rate drops eventually to
about 50mps.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to