Thank you for your expertise, I will have to rethink the architecture based on your very appreciated help !
Regards --- On Fri, 11/28/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Re: Nutch generate and fetch very slow after a few crawls (results) > To: [email protected] > Date: Friday, November 28, 2008, 11:35 AM > ML mail wrote: > > Dear Denis, > > > > Thanks for pointing the many source of problems on our > side. I just have a few thoughts about the points you > mentioned. > > > > 1) I was thinking NFS would be useful because then we > don't need for example to run the crawler on the same > machine as the web frontend. The NFS share is in our case > simply mounted on the crawler machine and on the Tomcat web > server machine. Also the Linux NFS server is dedicated just > for this task with Gbit ethernet connection, RAID 6, fast > Seagate enterprise SATA hard disks totaling to 8 TB of > storage. So what would be the best architecture for Nutch > crawling ? One huge very good machine with lots of CPU > power, lots of memory and lots of hard disk space ? And how > you do if you want to separate the web server front end from > the crawler ? > > > IMHO the best architecture for nutch crawing is a Hadoop > setup to > distribute the load. If you are using only a single > machine then I > would definitely have the file system be local. > > > 2) I totally agree 512 MB isn't much, that's > why I do small crawling sessions with only 50 threads for > the fetcher. > > Just FYI. The fetcher.threads.fetch config variable sets it > per task. > So if you have 10 tasks running on your one machine and > fetcher.threads.fetch is set to 50, you would actually have > 500 threads. > But even so you said your slowdown was in generating not > fetching. > > > > > 3) Our filter is really pretty much the same as the > default filter. So I don't think it's the filter, > except if the default filter is no good :-( > > Depends on the URLs being fetched. I have seen it loop out > before. I > don't think this is your problem though. Without > actually seeing the > system I think the problem is 1) NFS 2) memory. > > Dennis > > > > > Regards > > > > > > --- On Thu, 11/27/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: Nutch generate and fetch very slow > after a few crawls (results) > >> To: [email protected] > >> Date: Thursday, November 27, 2008, 2:22 PM > >> I think you have more than one problem going on > here to > >> create the situation you are facing. > >> > >> 1) The biggest problem I see is the NFS mount. > The > >> generate step is still going to write out > intermediate > >> output and is going to be doing lots of read > input. With > >> NFS it is having to do lots of network roundtrips. > Even > >> when using DFS, data is downloaded to local disk > first > >> before being processed in MapReduce. > >> > >> 2) 512M is low. I don't think that is the > biggest > >> problem but it could definitely be causing some > slowdown > >> because of the amount of data that can be read and > processed > >> at one time. > >> > >> 3) CPU is at 100% and if it stays that way for a > long time > >> there may be other issues going on beside just > processing > >> data. These could be filter regex related or > something > >> else. I don't think this is the main problem > though. > >> > >> Dennis > >> > >> ML mail wrote: > >>> Dear Dennis > >>> > >>> Here a few lines to explain better our current > >> configuration: Nutch 0.9 is running in an OpenVZ > Virtual > >> Private Server with 512 MB RAM allocated to it and > full CPU > >> usage (a Pentium IV Xeon 2.8 GHz, the machine > itself has 1 > >> GB physical RAM). NUTCH_HEAPSIZE in the nutch > shell script > >> is set to 350 MB instead of the original 1000 MB. > >>> Right now it is running the generate step > since approx > >> 3 hours for a topN of 40000 pages and usually it > takes > >> between 4 and 5 hours. The command top shows the > following: > >>> PID USER PR NI VIRT RES SHR S %CPU > %MEM > >> TIME+ COMMAND > > >> 16268 nutch 18 0 471m 159m 5808 S 100 > 31.1 > >> 171:03.82 java > >>> So it looks like the CPU is the limiting > factor in our > >> case, when I interpret this correctly. This is > quite strange > >> because a Pentium IV 2.8 GHz is not so old... > >>> Regarding the swapping, I can't see any > swapping > >> in the VPS as there is no swap really in OpenVZ, > here is the > >> output of "free -m" during the generate > step: > >>> total used free > shared > >> buffers cached > >>> Mem: 512 477 34 > 0 > >> 0 0 > >>> -/+ buffers/cache: 477 34 > >>> Swap: 0 0 0 > >>> > >>> Nutch binaries is stored and running from a > Linux/MD > >> software RAID 1 set with two Seagate ATA 7200 rpm > harddisks. > >> The crawl directory and all its data is stored on > a 8 TB > >> Linux RAID 6 NAS mounted via NFS with the > following NFS > >> mount options: > >>> rw,tcp,rsize=8192,wsize=8192 > >>> > >>> But if I understand correctly during the > generate step > >> of Nutch there is only very low I/O activity > because it is > >> only querying the crawldb, am I correct ? > >>> About the URL filter we use pretty much the > default > >> crawl-urlfilter.txt file except that we added > maybe 2-3 more > >> extensions to skip and changed the "accept > hosts" > >> to only index those ending in .be. You will find > our > >> crawl-urlfilter.txt file at the end of this mail. > >>> So I hope I have provided you enough > information, if > >> you need more just let me know. > >>> Thanks again and best regards > >>> > >>> > >>> ---------- crawl-urlfilter.txt ------- > >>> > >>> # skip file:, ftp:, & mailto: urls > >>> -^(file|ftp|mailto): > >>> > >>> # skip image and other suffixes we can't > yet parse > >>> > >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$ > >>> # skip URLs containing certain characters as > probable > >> queries, etc. > >>> [EMAIL PROTECTED] > >>> > >>> # skip URLs with slash-delimited segment that > repeats > >> 3+ times, to break loops > >>> -.*(/.+?)/.*?\1/.*?\1/ > >>> > >>> # accept hosts in MY.DOMAIN.NAME > >>> +^http://([a-z0-9]*\.).*\.be/ > >>> > >>> # skip everything else > >>> -. > >>> > >>> ---------- crawl-urlfilter.txt ------- > >>> > >>> > >>> > >>> > >>> --- On Wed, 11/26/08, Dennis Kubes > >> <[EMAIL PROTECTED]> wrote: > >>>> From: Dennis Kubes > <[EMAIL PROTECTED]> > >>>> Subject: Re: Nutch generate and fetch very > slow > >> after a few crawls (results) > >>>> To: [email protected] > >>>> Date: Wednesday, November 26, 2008, 4:46 > AM > >>>> Well, generate will still have to go > through all > >> of the > >>>> urls, although skipping after 25 per > domain should > >> be really quick. It > >>>> really depends on your hardware and any > regexes > >> you may be running in > >>>> urlfilters for the generate step. A > 2.8Ghz Xeon > >> with 1G ram should be > >>>> pretty quick. 100,000 pages on a core2duo > on my > >> laptop (4G Ram) takes > >>>> less than an hour if I remember correctly. > What > >> type of hard drive > >>>> (speed) are you using and are you swapping > a lot > >> during generate? It may > >>>> be the amount of RAM. > >>>> > >>>> Dennis > >>>> > >>>> ML mail wrote: > >>>>> Dear Richard and others interested, > >>>>> > >>>>> Just wanted to post the results of > reducing > >>>> generate.max.per.host to 25 (instead of > -1: > >> unlimited) as > >>>> recommended by Richard. > >>>>> So actually to summarize, the fetch > step has > >> been > >>>> greatly reduced to 1 hour instead of 6 > hours (for > >> topN set > >>>> at 25000) but unfortunately the generate > step is > >> still quite > >>>> slow and takes around 4 hours (for the > same topN > >> amount). > >>>>> Is this normal for the generate step > to still > >> be so > >>>> slow ? The whole index is only around > 170'000 > >> pages big. > >>>> Is there maybe also an option in > nutch-default.xml > >> config > >>>> file where one can optimize the generate > process ? > >>>>> Best regards > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> --- On Fri, 11/21/08, Richard Cyganiak > >>>> <[EMAIL PROTECTED]> wrote: > >>>>>> From: Richard Cyganiak > >> <[EMAIL PROTECTED]> > >>>>>> Subject: Re: Nutch generate and > fetch very > >> slow > >>>> after a few crawls > >>>>>> To: [email protected] > >>>>>> Date: Friday, November 21, 2008, > 2:42 AM > >>>>>> On 21 Nov 2008, at 09:47, ML mail > wrote: > >>>>>>> What would then be the > solution to > >> this > >>>> problem ? > >>>>>> Shall I simply set > generate.max.per.host > >> to > >>>> something like 5 > >>>>>> ? Or is there another way to make > run > >> Nutch at a > >>>> good speed > >>>>>> again ? > >>>>>> > >>>>>> I spent some time trying different > values > >> for > >>>>>> generate.max.per.host and I found > that > >> this is a > >>>> good rule > >>>>>> of thumb: > >>>>>> > >>>>>> generate.max.per.host = topN / > >> numberOfNodes / > >>>> 1000 > >>>>>> Where topN is the size of your > segments, > >>>> numberOfNodes is > >>>>>> the number of machines in your > cluser. > >> This keeps > >>>> the fetch > >>>>>> rate close to maximum. > >>>>>> > >>>>>> Check the log of the fetch job -- > if the > >> last few > >>>> pages > >>>>>> consist of request to just one or > a few > >> hosts, > >>>> then your > >>>>>> value for generate.max.per.host is > too > >> large. You > >>>> want to > >>>>>> fetch from many hosts in parallel > >> throughout the > >>>> entire > >>>>>> fetch job. On the other hand, if > you set > >> it too > >>>> low, then > >>>>>> you will never make progress on > these > >> large sites. > >>>>>> I fetched the same segment > repeatedly to > >> find out > >>>> what > >>>>>> values work best. > >>>>>> > >>>>>> Hope that helps, > >>>>>> Richard > >>>>>> > >>>>>> > >>>>>>> Best regards > >>>>>>> > >>>>>>> > >>>>>>> --- On Thu, 11/20/08, Dennis > Kubes > >>>>>> <[EMAIL PROTECTED]> wrote: > >>>>>>>> From: Dennis Kubes > >>>> <[EMAIL PROTECTED]> > >>>>>>>> Subject: Re: Nutch > generate and > >> fetch very > >>>> slow > >>>>>> after a few crawls > >>>>>>>> To: > [email protected] > >>>>>>>> Date: Thursday, November > 20, 2008, > >> 2:40 PM > >>>>>>>> Off the top of my head I > would > >> guess that > >>>> you hit > >>>>>> a patch of > >>>>>>>> urls all from the same > domain and > >> that > >>>> would slow > >>>>>> down > >>>>>>>> fetching on a single host > because > >> only one > >>>> thread > >>>>>> would be > >>>>>>>> active? The > generate.max.per.host > >> config > >>>> variable > >>>>>> can limit > >>>>>>>> that. > >>>>>>>> > >>>>>>>> But that is just a guess. > What > >> job is it > >>>> slowing > >>>>>> down on? > >>>>>>>> Yes Nutch will take more > time with > >> more > >>>> data, but > >>>>>> that is > >>>>>>>> too much of a difference. > >>>>>>>> > >>>>>>>> Dennis > >>>>>>>> > >>>>>>>> ML mail wrote: > >>>>>>>>> Hello, > >>>>>>>>> > >>>>>>>>> I am currently using > the > >> recrawl > >>>> script from > >>>>>> the Nutch > >>>>>>>> Wiki for crawling all > websites > >> from a > >>>> specific > >>>>>> small top > >>>>>>>> level domain and have > configured > >> the > >>>> recrawl > >>>>>> script to run > >>>>>>>> with THREADS=50, DEPTH=4, > >> TOPN=25000. > >>>> Which means > >>>>>> that each > >>>>>>>> time I run the script > 100'000 > >> pages > >>>> will get > >>>>>> crawled. > >>>>>>>>> The first time I ran > the > >> script it > >>>> took 6 > >>>>>> hours for > >>>>>>>> the whole process with > mergesegs, > >>>> inverlinks, > >>>>>> index, merge > >>>>>>>> and so on. The second time > it took > >> just 3 > >>>> hours > >>>>>> more so 9 > >>>>>>>> hours, then the 4th time > 12 hours > >> but now > >>>> the > >>>>>> fourth time it > >>>>>>>> is actually still running > after 22 > >> hours > >>>> and > >>>>>> it's only > >>>>>>>> at the 64'000 page to > be > >> crawled. It > >>>> looks > >>>>>> like that it > >>>>>>>> is especially the fetch > step and > >> the index > >>>> step > >>>>>> which are > >>>>>>>> running much more slowly, > the > >> other steps > >>>> look > >>>>>> normal. > >>>>>>>>> So is this actually a > normal > >> behavior > >>>> of Nutch > >>>>>> ? I > >>>>>>>> would expect Nutch to be > each time > >> a tiny > >>>> little > >>>>>> bit more > >>>>>>>> slower due to updating an > always > >> growing > >>>>>>>> database/index/segment but > never > >> so much > >>>> slower as > >>>>>> I am > >>>>>>>> currently experiencing. > Especially > >> when > >>>> right now > >>>>>> there are > >>>>>>>> only 144'915 pages > indexed and > >> the > >>>> whole crawl > >>>>>> directly > >>>>>>>> with everything is only > around 2 > >> GB big. > >>>>>>>>> Nutch is running on a > quite > >> good > >>>> Pentium 4 > >>>>>> Xeon > >>>>>>>> computer 2.8 GHz with 1 GB > RAM and > >> nothing > >>>> else > >>>>>> mutch > >>>>>>>> running on it, also I > didn't > >> change > >>>> much in > >>>>>> the config > >>>>>>>> of Nutch itself so > it's pretty > >> much > >>>> default. > >>>>>>>>> Does anyone have an > idea ? I > >> can > >>>> provide more > >>>>>> info if > >>>>>>>> you desire, just let me > know what > >> you > >>>> need. > >>>>>>>>> Many thanks in advance > and > >> best > >>>> regards > >>>>> > >>> > >>> > > > > > > > >
