Hi Jeongseob, Exactly. You need to perform the crawling phase multiple times so that you get a larger index.
You don't really need to crawl the same public sites we have crawled nor use the same terms_en.out. In any case, wikipedia was one of them. You need to have enough clients to saturate your CPU while maintaining quality-of-service. Hope this helps. -Stavros. On Jun 8, 2014, at 8:27 PM, J Ahn wrote: I am just wondering how to increase the size of crawled index and segments. It seems that we need to crawl the large data set again. Is this right?? In addition, I would like to reproduce the experimental results appeared in the paper, clearing the clouds. The paper used an index size of 2GB and data segment size of 23GB of content crawled from the public web. Could you explain me which public sites you crawled ? Next, I have a question about configuring clients. How many clients are used in the experiments? and what terms_en.out is used ? - Jeongseob 2013-06-09 16:16 GMT+09:00 Hailong Yang <[email protected]<mailto:[email protected]>>: Hi Zacharias, Have you tried to increase the size of your crawled index and segments? For example, the clearing cloud paper says they used 2GB index and 23GB segments. Best Hailong On Fri, May 31, 2013 at 10:24 PM, zhadji01 <[email protected]<mailto:[email protected]>> wrote: Hi, I have a web-search benchmark setup with 4 machines 1 client, 1 front-end, 1 search server and 1 segment server for fetching the summaries. All machines are two-socket Xeon E5620 @2.4Ghz, 32GB RAM and they are connected with 1Gb Ethernet. My crawled data is 400 MB index and 4GB segments. My problem is that the servers' cpu utilization is very low. The max throughput I managed to get using the faban client or apache benchmark was ~400-450 queries/sec with user cpu utlizations: frontend ~5%, search server ~ 10%, segment server ~35-39%. I'm sure that the network is not the bottleneck cause I'm not even close to fill the bandwidth. Can you give any suggestions on how to utilize well the servers or any thoughts on what can be the problem? Thanks Zacharias Hadjilambrou
