> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] > > Doug, what about scaling with the number of concurrent > queries? In other > words, assuming that the queries are similar, how is the search > performance affected by increase in the number of concurrent users?
On a single-CPU, single-disk machine, throughput will likely be best with a few simultaneous queries. For example, if each query were to spend half of its time on the CPU and half the time waiting for i/o, and queries take two seconds total, then the ideal would be to have queries arrive every second, so that while one was using the CPU the other could use the disk. Things don't usually work out quite so neatly, but you get the idea. Multi-CPU and/or multi-disk systems can provide greater parallelism and hence query throughput. However Lucene's FSDirectory serializes reads to a given file (since it only has a single file descriptor per file) which limits i/o parallelism. Someone with a large disk array would be better served by a Directory implementation that uses Java 1.4's new i/o classes. In particular, the FileChannel class supports reads that do not move the file pointer, so that multiple reads on the same file can be in progress at the same time. Instead of building big boxes, another approach to scaling performance is to use lots of little boxes. Here you split the index into sub-indexes, broadcast the search to all of the sub-indexes, and merge the results. To minimize the number of boxes, you want to give each box the largest index it can search quickly, typically one that mostly fits in RAM. It would be fairly easy to implement such a distributed search system with Lucene, writing a multi-threaded version of MultiSearcher that uses RPC to send queries to remote systems. Deciding which of adding CPUs, disks or RAM is most cost-effective is complex. In short, the best approach is to benchmark various configurations. The big internet search engines do all of the above, plus other tricks. For example, they can keep two indexes: one containing one million "popular" pages, and another containing 100 million less popular pages. If they don't let folks see past the 1000th hit, and they find 1500 hits in the popular index, then they can return just the hits from the popular index, and report that there are 150,000 hits total. They don't know how many there really are, but that's a good estimate and no one can prove them wrong. If they can thus answer the vast majority of queries from a small, "popular" index, then they don't have to make the big index as scalable, and can save lots of money. Doug _______________________________________________ Lucene-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/lucene-dev
