Re: [cloudsuite] Re: Web search low utilization

J Ahn Thu, 19 Jun 2014 02:23:31 -0700

Hi Volos.

The reason why I ask is that I am not familiar with the Nutch and crawling
the web sites.
Specifically, I do not understand how to perform the crawling multiple
times. You mean, I just perform the crawling the same public site, which is
described in *urllist.txt*, multiple times to get a larger index?


I crawled the *wikipedia.org <http://wikipedia.org>* website, but the index
and segments size are 24MB and 156MB, respectively.

1) Firstly, I modified the urllist.txt to crawl the wikipedia.
*# echo http://www.apache.org <http://www.apache.org> >
$HADOOP_HOME/urlsdir/urllist.txt*


*# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir*
2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover *.
wikipedia.org.
# *vim conf/crawl-urlfilter.txt*

3) Finally, I just launched the crawler
# *$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3*


Is there any problems to increase the index and segment size?

- Jeongseob


2014-06-16 18:59 GMT+09:00 Volos Stavros <[email protected]>:

>  Hi Jeongseob,
>
>  Exactly. You need to perform the crawling phase multiple times so that
> you get a larger index.
>
>  You don't really need to crawl the same public sites we have crawled nor
> use the same terms_en.out. In any case, wikipedia was one of them.
>
>  You need to have enough clients to saturate your CPU while maintaining
> quality-of-service.
>
>  Hope this helps.
>
>  -Stavros.
>
>  On Jun 8, 2014, at 8:27 PM, J Ahn wrote:
>
>   I am just wondering how to increase the size of crawled index and
> segments. It seems that we need to crawl the large data set again.
>  Is this right??
>
>  In addition, I would like to reproduce the experimental results appeared
> in the paper, *clearing the clouds*. The paper used an index size of 2GB
> and data segment size of 23GB of content crawled from the public web. Could
> you explain me which public sites you crawled ?
>
>  Next, I have a question about configuring clients. How many clients are
> used in the experiments? and what terms_en.out is used ?
>
>  - Jeongseob
>
>
> 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]>:
>
>> Hi Zacharias,
>>
>>  Have you tried to increase the size of your crawled index and segments?
>> For example, the clearing cloud paper says they used 2GB index and 23GB
>> segments.
>>
>>  Best
>>
>>  Hailong
>>
>>
>> On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a web-search benchmark setup with 4 machines 1 client, 1
>>> front-end, 1 search server and 1 segment server for fetching the summaries.
>>>
>>> All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they  are
>>> connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB
>>> segments.
>>>
>>> My problem is that the servers' cpu utilization is very low. The max
>>> throughput I managed to get using the faban client or apache benchmark was
>>> ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server
>>> ~ 10%, segment server ~35-39%.
>>>
>>> I'm sure that the network is not the bottleneck cause I'm not even close
>>> to fill the bandwidth.
>>>
>>> Can you give any suggestions on how to utilize well the servers or any
>>> thoughts on what can be the problem?
>>>
>>> Thanks Zacharias Hadjilambrou
>>>
>>
>>
>
>

Re: [cloudsuite] Re: Web search low utilization

Reply via email to