Hi,

You can follow the commands at 3.2: http://wiki.apache.org/nutch/NutchTutorial

You can use individual commands to crawl multiple times. The way it works is 
that the first time you create a crawl database (list of links to act as root) 
using the domains of interest (e..g, wikipedia.org<http://wikipedia.org>). 
After your crawl using that root (you can define the # pages, and the depth) 
you can update your crawldb with all fetched links in the new segment and 
repeat the process.

After your fetched enough pages, you can create the inverted index.

Hope this helps.

Regards,
-Stavros.

On Jun 19, 2014, at 11:22 AM, J Ahn wrote:

Hi Volos.

The reason why I ask is that I am not familiar with the Nutch and crawling the 
web sites.
Specifically, I do not understand how to perform the crawling multiple times. 
You mean, I just perform the crawling the same public site, which is described 
in urllist.txt, multiple times to get a larger index?

I crawled the wikipedia.org<http://wikipedia.org/> website, but the index and 
segments size are 24MB and 156MB, respectively.

1) Firstly, I modified the urllist.txt to crawl the wikipedia.
# echo http://www.apache.org<http://www.apache.org/> > 
$HADOOP_HOME/urlsdir/urllist.txt
# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir

2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover 
*.wikipedia.org<http://wikipedia.org/>.
# vim conf/crawl-urlfilter.txt

3) Finally, I just launched the crawler
# $HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3


Is there any problems to increase the index and segment size?

- Jeongseob


2014-06-16 18:59 GMT+09:00 Volos Stavros 
<[email protected]<mailto:[email protected]>>:
Hi Jeongseob,

Exactly. You need to perform the crawling phase multiple times so that you get 
a larger index.

You don't really need to crawl the same public sites we have crawled nor use 
the same terms_en.out. In any case, wikipedia was one of them.

You need to have enough clients to saturate your CPU while maintaining 
quality-of-service.

Hope this helps.

-Stavros.

On Jun 8, 2014, at 8:27 PM, J Ahn wrote:

I am just wondering how to increase the size of crawled index and segments. It 
seems that we need to crawl the large data set again.
Is this right??

In addition, I would like to reproduce the experimental results appeared in the 
paper, clearing the clouds. The paper used an index size of 2GB and data 
segment size of 23GB of content crawled from the public web. Could you explain 
me which public sites you crawled ?

Next, I have a question about configuring clients. How many clients are used in 
the experiments? and what terms_en.out is used ?

- Jeongseob


2013-06-09 16:16 GMT+09:00 Hailong Yang 
<[email protected]<mailto:[email protected]>>:
Hi Zacharias,

Have you tried to increase the size of your crawled index and segments? For 
example, the clearing cloud paper says they used 2GB index and 23GB segments.

Best

Hailong


On Fri, May 31, 2013 at 10:24 PM, zhadji01 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I have a web-search benchmark setup with 4 machines 1 client, 1 front-end, 1 
search server and 1 segment server for fetching the summaries.

All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they  are 
connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB segments.

My problem is that the servers' cpu utilization is very low. The max throughput 
I managed to get using the faban client or apache benchmark was ~400-450 
queries/sec with user cpu utlizations: frontend ~5%, search server ~ 10%, 
segment server ~35-39%.

I'm sure that the network is not the bottleneck cause I'm not even close to 
fill the bandwidth.

Can you give any suggestions on how to utilize well the servers or any thoughts 
on what can be the problem?

Thanks Zacharias Hadjilambrou





Reply via email to