Hi Everyone,
I've been lurking on this list for a couple of weeks now. I thought I
would contribute my experiences (with timings) of using lucene.
The main issue we have is why does performance decrease significantly
when searching with multiple threads?
I hope this helps people starting out with lucene to compare with their
performance.
Hamish
BTW: Optimizing took 3.5 minutes at 500,000 documents and 4.7 minutes at
1,000,000 documents. Sorry I don't have memory usage figures.
<benchmark>
Hardware environment
--------------------
Dedicated machine for indexing (yes/no): yes
CPU (Type, Speed and Quantity): Intel x86 P4 1.5Ghz
RAM: 512 DDR
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE 7200rpm Raid-1
Software environment
--------------------
Java Version: 1.3.1 IBM JITC Enabled
OS Version: Debian Linux 2.4.18-686
Location of index directory (local/network): local
Lucene indexing variables
-------------------------
Number of source documents: Random generator. Set to make 1M documents
in 2x500,000 batches.
Total filesize of source documents: > 1Gb if stored.
Average filesize of source documents (in KB/MB): 1kb
Source documents storage location (filesystem, DB, http,etc): fs
File type of source documents: generated.
Parser(s) used, if any: default
Analyzer(s) used: default
Number of fields per document: 11
Type of fields: 1 date, 1 id, 9 text
Index persistence (FSDirectory, SqlDirectory, etc): FSDirectory
Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed: 49seconds
Memory consumption: unsure
Notes (any special tuning/strategies):
--------------------------------------
A windows client ran a random document generator which created
documents based on some arrays of values and an excerpt (approx 1kb)
from a text file of the bible (King James version).
These were submitted via a socket connection (open throughout
indexing process).
The index writer was not closed between index calls.
This created a 400Mb index in 23 files (after optimization).
Query details:
--------------
Set up a threaded class to start x number of simultaneous threads to
search the above created index.
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo* Tea
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+DisplayStartDate:[mkwsw2jk0
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
This query counted 34000 documents and I limited the returned documents
to 5.
This is using Peter Halacsy's IndexSearcherCache slightly modified to
be a singleton returned cached searchers for a given directory. This
solved an initial problem with too many files open and running out of
linux handles for them.
Threads|Avg Time per query (ms)
1 1009ms
2 2043ms
3 3087ms
4 4045ms
. .
. .
10 10091ms
I removed the two date range terms from the query and it made a HUGE
difference in performance. With 4 threads the avg time dropped to 900ms!
Other query optimizations made little difference.
</benchmark>
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
- RE: Conclusion: Escaping Does *not* Work Spencer, Dave
- basic query and parser question. Raj
- Re: Benchmarking Results Hamish Carpenter
- Re: Benchmarking Results Kelvin Tan
