Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals

Andrzej Bialecki Fri, 29 Apr 2005 03:24:57 -0700

Michael Nebel wrote:

Hi Byron,
for myself, I wanted to know, how many connection my server can handle. So I made some tests myself. The observations where suprising but logical for me. I used some old desktop-pcs under linux (Pentium3-800 MHz, ~256 MB, IDE-Disks), so I hadn't much load to generate to see the bottlenecks :-)

The response time of nutch is not only dependant on the number of parallel requests, but also on the number of responses nutch returns (hitrate). The reason is simple: for each hit, the summary is loaded from disk. This causes much disk i/o which slowed my server much more down than the slow cpu and the low ram.

This is slightly incorrect. The summaries are only accessed for the first page of results, not for all hits. So, no matter how many hits there are, only the currently displayed page needs the summaries.

So I would suggest to use a static set of queries and an identical set of segments to generate the numbers.

If you repeat the same query twice, of course the results will come back faster, because the relevant data will be loaded into the OS disk cache.

A interesing number is the responsetime per hit and parallel request. I would expect, that the size of the index has an influence on the number of hits returned and an influence, how long it takes to locate the summaray on disk.

Summaries are located in nearly constant time - each data file in a segment is accompanied by a small "index" file (note: this is NOT the Lucene index!), which is loaded entirely in memory. The time to seek in the file to the right position is then limited in the worst case to the time it takes to seek between consecutive positions in that index. IIRC an index entry is created every 128 data entries.

One question I still have is, how does the number and size of segments per search-server influence the response time? What is better: many small segments or one big. Looking at the servers I use for the tests - you can imagine my problem to run this kind of test :-)

Random access to MapFile-s is more or less independent of the data file size, as explained above. So, I think it's better to have one big segment than many small ones. What is much more important is to make sure that the "index" files (mentioned above) are not corrupted - in such case everything will appear to work correctly, but seeking performance will be terrible.

Related to this, it is also better to use a single merged Lucene index than many per-segment indexes - the latter will work as well, but performance will be lower, and also there might be weird problems with scoring.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Created: (NUTCH-50) Benchmarks & Performance goals

Reply via email to