Sorry, there is a typo, actually, I worte the 'www.wikipedia.org' in
urllist.txt.

1) Firstly, I modified the urllist.txt to crawl the wikipedia.
*# echo http://www.wikipedia.org <http://www.apache.org> >
$HADOOP_HOME/urlsdir/urllist.txt*
*# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir*


2014-06-19 18:22 GMT+09:00 J Ahn <[email protected]>:

> Hi Volos.
>
> The reason why I ask is that I am not familiar with the Nutch and crawling
> the web sites.
> Specifically, I do not understand how to perform the crawling multiple
> times. You mean, I just perform the crawling the same public site, which is
> described in *urllist.txt*, multiple times to get a larger index?
>
> I crawled the *wikipedia.org <http://wikipedia.org>* website, but the
> index and segments size are 24MB and 156MB, respectively.
>
> 1) Firstly, I modified the urllist.txt to crawl the wikipedia.
> *# echo http://www.apache.org <http://www.apache.org> >
> $HADOOP_HOME/urlsdir/urllist.txt*
>
>
> *# $HADOOP_HOME/bin/hadoop dfs -put urlsdir urlsdir*
> 2) Secondly, I updated 'vim conf/crawl-urlfilter.txt' file to cover *.
> wikipedia.org.
> # *vim conf/crawl-urlfilter.txt*
>
> 3) Finally, I just launched the crawler
> # *$HADOOP_HOME/bin/nutch crawl urlsdir -dir crawl -depth 3*
>
>
> Is there any problems to increase the index and segment size?
>
> - Jeongseob
>
>
> 2014-06-16 18:59 GMT+09:00 Volos Stavros <[email protected]>:
>
>  Hi Jeongseob,
>>
>>  Exactly. You need to perform the crawling phase multiple times so that
>> you get a larger index.
>>
>>  You don't really need to crawl the same public sites we have crawled
>> nor use the same terms_en.out. In any case, wikipedia was one of them.
>>
>>  You need to have enough clients to saturate your CPU while maintaining
>> quality-of-service.
>>
>>  Hope this helps.
>>
>>  -Stavros.
>>
>>  On Jun 8, 2014, at 8:27 PM, J Ahn wrote:
>>
>>   I am just wondering how to increase the size of crawled index and
>> segments. It seems that we need to crawl the large data set again.
>>  Is this right??
>>
>>  In addition, I would like to reproduce the experimental results appeared
>> in the paper, *clearing the clouds*. The paper used an index size of 2GB
>> and data segment size of 23GB of content crawled from the public web. Could
>> you explain me which public sites you crawled ?
>>
>>  Next, I have a question about configuring clients. How many clients are
>> used in the experiments? and what terms_en.out is used ?
>>
>>  - Jeongseob
>>
>>
>> 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]>:
>>
>>> Hi Zacharias,
>>>
>>>  Have you tried to increase the size of your crawled index and
>>> segments? For example, the clearing cloud paper says they used 2GB index
>>> and 23GB segments.
>>>
>>>  Best
>>>
>>>  Hailong
>>>
>>>
>>> On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a web-search benchmark setup with 4 machines 1 client, 1
>>>> front-end, 1 search server and 1 segment server for fetching the summaries.
>>>>
>>>> All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they  are
>>>> connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB
>>>> segments.
>>>>
>>>> My problem is that the servers' cpu utilization is very low. The max
>>>> throughput I managed to get using the faban client or apache benchmark was
>>>> ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server
>>>> ~ 10%, segment server ~35-39%.
>>>>
>>>> I'm sure that the network is not the bottleneck cause I'm not even
>>>> close to fill the bandwidth.
>>>>
>>>> Can you give any suggestions on how to utilize well the servers or any
>>>> thoughts on what can be the problem?
>>>>
>>>> Thanks Zacharias Hadjilambrou
>>>>
>>>
>>>
>>
>>
>

Reply via email to