Sorry, there is a typo, actually, I worte the 'www.wikipedia.org' in urllist.txt.
1) Firstly, I modified the urllist.txt to crawl the wikipedia. *# echo http://www.wikipedia.org <http://www.apache.org> > $HADOOP_HOME/urlsdir/urllist.txt* *# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir* 2014-06-19 18:22 GMT+09:00 J Ahn <[email protected]>: > Hi Volos. > > The reason why I ask is that I am not familiar with the Nutch and crawling > the web sites. > Specifically, I do not understand how to perform the crawling multiple > times. You mean, I just perform the crawling the same public site, which is > described in *urllist.txt*, multiple times to get a larger index? > > I crawled the *wikipedia.org <http://wikipedia.org>* website, but the > index and segments size are 24MB and 156MB, respectively. > > 1) Firstly, I modified the urllist.txt to crawl the wikipedia. > *# echo http://www.apache.org <http://www.apache.org> > > $HADOOP_HOME/urlsdir/urllist.txt* > > > *# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir* > 2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover *. > wikipedia.org. > # *vim conf/crawl-urlfilter.txt* > > 3) Finally, I just launched the crawler > # *$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3* > > > Is there any problems to increase the index and segment size? > > - Jeongseob > > > 2014-06-16 18:59 GMT+09:00 Volos Stavros <[email protected]>: > > Hi Jeongseob, >> >> Exactly. You need to perform the crawling phase multiple times so that >> you get a larger index. >> >> You don't really need to crawl the same public sites we have crawled >> nor use the same terms_en.out. In any case, wikipedia was one of them. >> >> You need to have enough clients to saturate your CPU while maintaining >> quality-of-service. >> >> Hope this helps. >> >> -Stavros. >> >> On Jun 8, 2014, at 8:27 PM, J Ahn wrote: >> >> I am just wondering how to increase the size of crawled index and >> segments. It seems that we need to crawl the large data set again. >> Is this right?? >> >> In addition, I would like to reproduce the experimental results appeared >> in the paper, *clearing the clouds*. The paper used an index size of 2GB >> and data segment size of 23GB of content crawled from the public web. Could >> you explain me which public sites you crawled ? >> >> Next, I have a question about configuring clients. How many clients are >> used in the experiments? and what terms_en.out is used ? >> >> - Jeongseob >> >> >> 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]>: >> >>> Hi Zacharias, >>> >>> Have you tried to increase the size of your crawled index and >>> segments? For example, the clearing cloud paper says they used 2GB index >>> and 23GB segments. >>> >>> Best >>> >>> Hailong >>> >>> >>> On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a web-search benchmark setup with 4 machines 1 client, 1 >>>> front-end, 1 search server and 1 segment server for fetching the summaries. >>>> >>>> All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they are >>>> connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB >>>> segments. >>>> >>>> My problem is that the servers' cpu utilization is very low. The max >>>> throughput I managed to get using the faban client or apache benchmark was >>>> ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server >>>> ~ 10%, segment server ~35-39%. >>>> >>>> I'm sure that the network is not the bottleneck cause I'm not even >>>> close to fill the bandwidth. >>>> >>>> Can you give any suggestions on how to utilize well the servers or any >>>> thoughts on what can be the problem? >>>> >>>> Thanks Zacharias Hadjilambrou >>>> >>> >>> >> >> >
