On 10/10/06, David Balmain <[EMAIL PROTECTED]> wrote:
I did set maxBufferedDocs to 1000 and optimized both indeces at the end but I didn't use non-compound format. I think it is better to use compound file format as it is default in both libraries and the penalty will be similar in both cases.
When people care about performance, I always advise to use the non-compound format. If the number of files gets to large, it's better (in general) to decrease mergeFactor rather than switch to the compound format. Since we are benchmarking performance, we should use the recommendations we would give to others who are trying to get the best performance (like use the latest JVM, use -server, use a big enough heap, increase maxBufferedDocs, etc).
If you really like I can tell you what the difference is for my tests. Please feel free to tell me where else I can improve the Lucene benchmarker.
Where is the source? I'd be most interested in testing it on the current version of Lucene in the SVN trunk.
> So is Ferret faster for searching too? The absence of stats suggests > that it's not :-) :-) Well, I'd like to think the absence of stats for searching has nothing to do with Lucene being faster.
Does that mean it's an unknown? You haven't tested it?
For starters, the indexing time is the a lot more noticable to the user.
In general, I would have assumed the opposite. I guess it depends a lot on the usage patterns, but at CNET, indexing time is relatively unimportant for most of our collections... as long as it keeps up with document changes, it isn't too much of an issue. Searches, on the other hand, are very important. Some searches are even done as part of the dynamic generation of page content, so the latency of the search adds to the latency of the page as a whole! In other collections, throughput is most important, as long as most searches take less than 1 second. But since our searches are normally CPU bound, there is normally a rather direct correlation between the latency of a single request and the throughput of the system as a whole. When I've had to do performance work in the past, it's *always* been on the search side.
And benchmarking searching is a little more difficult. There are numerous Queries, Filters and Sorts to test and it's important to test with optimized and unoptimized indexes. Anyway, I'll attempt to put a search benchmark out tomorrow.
It doesn't have to be all or nothing... we could just start out with some of the most common: - some single term queries - some multi term queries - some phrase queries No filters, sort by relevance, take the top 50, don't retrieve stored fields. Assuming there is no caching, putting these queries in a loop to get a run that lasts several minutes would be good. In the future, it would be nice to test multiple clients (threads) at once since since it more closely simulates a server environment. One could also think about automating the creation of queries... find the top terms in the corpus and use those terms to create random queries. Certainly not as realistic as using a real query log, but it can be used for any corpus. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]