Hi Volos. The reason why I ask is that I am not familiar with the Nutch and crawling the web sites. Specifically, I do not understand how to perform the crawling multiple times. You mean, I just perform the crawling the same public site, which is described in *urllist.txt*, multiple times to get a larger index?
I crawled the *wikipedia.org <http://wikipedia.org>* website, but the index and segments size are 24MB and 156MB, respectively. 1) Firstly, I modified the urllist.txt to crawl the wikipedia. *# echo http://www.apache.org <http://www.apache.org> > $HADOOP_HOME/urlsdir/urllist.txt* *# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir* 2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover *. wikipedia.org. # *vim conf/crawl-urlfilter.txt* 3) Finally, I just launched the crawler # *$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3* Is there any problems to increase the index and segment size? - Jeongseob 2014-06-16 18:59 GMT+09:00 Volos Stavros <[email protected]>: > Hi Jeongseob, > > Exactly. You need to perform the crawling phase multiple times so that > you get a larger index. > > You don't really need to crawl the same public sites we have crawled nor > use the same terms_en.out. In any case, wikipedia was one of them. > > You need to have enough clients to saturate your CPU while maintaining > quality-of-service. > > Hope this helps. > > -Stavros. > > On Jun 8, 2014, at 8:27 PM, J Ahn wrote: > > I am just wondering how to increase the size of crawled index and > segments. It seems that we need to crawl the large data set again. > Is this right?? > > In addition, I would like to reproduce the experimental results appeared > in the paper, *clearing the clouds*. The paper used an index size of 2GB > and data segment size of 23GB of content crawled from the public web. Could > you explain me which public sites you crawled ? > > Next, I have a question about configuring clients. How many clients are > used in the experiments? and what terms_en.out is used ? > > - Jeongseob > > > 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]>: > >> Hi Zacharias, >> >> Have you tried to increase the size of your crawled index and segments? >> For example, the clearing cloud paper says they used 2GB index and 23GB >> segments. >> >> Best >> >> Hailong >> >> >> On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]> wrote: >> >>> Hi, >>> >>> I have a web-search benchmark setup with 4 machines 1 client, 1 >>> front-end, 1 search server and 1 segment server for fetching the summaries. >>> >>> All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they are >>> connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB >>> segments. >>> >>> My problem is that the servers' cpu utilization is very low. The max >>> throughput I managed to get using the faban client or apache benchmark was >>> ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server >>> ~ 10%, segment server ~35-39%. >>> >>> I'm sure that the network is not the bottleneck cause I'm not even close >>> to fill the bandwidth. >>> >>> Can you give any suggestions on how to utilize well the servers or any >>> thoughts on what can be the problem? >>> >>> Thanks Zacharias Hadjilambrou >>> >> >> > >
