Re: Two possible solutions on Parallel Searching

Doug Cutting Thu, 13 Nov 2003 10:20:41 -0800

First, note that the approaches you describe will only improve performance if you have multiple CPUs and/or multiple disks holding the indexes.

Second, MultiSearcher is currently implemented to search indexes serially, not each in a separate thread. To implement multi-threaded searching one could subclass MultiSearcher with, e.g., ParallelSearcher, and override the search() methods with multi-threaded implemenations. This would be a great contribution if someone is interested!

The parallel approach I prefer is to maintain a set of indexes, each on a separate machine, then use something like a ParallelSearcher of RemoteSearchables to search them all.

Doug

Jie Yang wrote:

I had a thought on my earlier post on "Poor Performance when searching for 500+ terms".
The problem is on how to improve the performance when
searching for 500+ OR search terms. i.e. enter a
search string of :
W1 OR W2 OR W3 OR ...... OR w500.
I thought I could rewrite the MultiSearcher class so
that it can initiate multiple parallel IndexSearchers
to perform the search.
Solution 1 would be divide the query string of "500 OR
conditions terms" into 25 "20 OR conditions terms",
and then pass them to MultiSearcher, MultiSearcher
then initiate 25 threads to search on a single index
directory.
Solution 2 would be when building an index of 1 million docs, instead of building one single index containing 1 million docs, build 10 index directory eaching containing 100K records. then I pass a single query string of "500 OR conditions terms" to MultiSearcher, MultiSearcher then initiate 10 threads to search for 10 different index directories.

Has anyone tried something similar, which solution would be a better one. Also is using multiple threads on a single directory a good ideal? Are there any bottlenecks for threads acessing resources, or I better pass requests into different processes.

Thanks a lot
 --- Jie Yang <[EMAIL PROTECTED]> wrote: > Thanks
Julian
I am not using RAMDirectory due to the large size of
index file. the index generated on hard disc is
1.57G
for 1 million documents, each document has average
500
terms. I am using Field.UnStored(fieldName, terms),
so
i beliece I am not storing the documents, just the
index. (is that right?) is there anyway to reduce
the
index size created? also What is the maximum size of
data can be stored in RAMDirectory? I suppose I
could
get a 10G RAM solaris box, but would that be
advisable
say storing 2-3G of index data in memory? Also, what
is the performance boost factor when RAMDirectory
comparing to FSDirectory. Are we taling about > 100%
here?
On your 2nd and 3rd suggestion, I probably run the
latest code that includes the fix by Dmitry
Serebrennikov, the build was checked out from CVS
yesterday. and I used a QueryParser similar to the
one
used in the demo code.
Again, I still feel a bit curious and want to find
out
does lucene do (or in the future) pre-filter on "AND
join conditions". For example, A AND (B OR C OR D).
if
A finds 100 docs out of 1 million, can lucene
restrict
the searchs on B,C,D only within the 100 docs found?
Thanks a lot.

Response to: Poor Performance when searching for
500+

terms (Jie Yang)

From: Julien Nioche <[EMAIL PROTECTED]>
Subject: Poor Performance when searching for 500+
terms
Date: Thu, 13 Nov 2003 12:45:50 +0100
Content-Type: text/plain; charset="iso-8859-1"
Hello,

Since there are a lot of Term objects in your
Query,

your application must spend a lot of time collecting information about those Terms.

1/ Do you use RAMDirectory? Loading the whole Directory into memory will increase speed - your index must not be too big
though

2/ You are probably not using the QueryParser - so when you are building the Query you could sort the Term objects inside a BooleanQuery. Sorting the Terms will reduce jumps on disk. I have no
benchmarks
for this, but
logically, it should have some positive effect when
using FSDirectory. Am I wrong?

3/ There was a patch submitted by Dmitry
Serebrennikov

(http://www.mail-archive.com/[EMAIL PROTECTED]/msg02762.html)

which reduced garbage collecting by limiting the creation of temporary Term objects. This patch has not been included in Lucene code (a bug in it?).

Hope it helps.

Julien
________________________________________________________________________

Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk
________________________________________________________________________
Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://mail.messenger.yahoo.co.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two possible solutions on Parallel Searching

Reply via email to