Re: Two possible solutions on Parallel Searching

Jie Yang Thu, 13 Nov 2003 14:54:36 -0800

In this case, probably using a single RAMDirectory
would allow me to run parallel searching without worry
about disk access. Well, anyone tried to have a
RAMDirectory of 5G in size?


 --- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote: > Multiple threads against the same index or
multiple
> indices - no
> advantage - think about the mechanical parts
> involved (disk head).
> 
> Multiple threads against indices on different disks
> (not just
> paritions!) - yes, that would be faster.
> 
> Reading the index from the disk is the bottleneck,
> not the CPU.
> So if you split the index over multiple disks and
> access them in
> parallel, you have a good solution.
> 
> When you hit the CPU limit, then you spread this
> over multiple search
> servers then you again hit in parallel.
> 
> If each of them has multiple disks, you can search
> them in parallel.
> 
> And so on.... :)
> 
> Otis
> 
> 
> --- Jie Yang <[EMAIL PROTECTED]> wrote:
> >  --- Doug Cutting <[EMAIL PROTECTED]> wrote: >
> First,
> > note that the approaches you describe will
> > > only improve 
> > > performance if you have multiple CPUs and/or
> > > multiple disks holding the 
> > > indexes.
> > > 
> > > Second, MultiSearcher is currently implemented
> to
> > > search indexes 
> > > serially, not each in a separate thread.  To
> > > implement multi-threaded 
> > > searching one could subclass MultiSearcher with,
> > > e.g., ParallelSearcher, 
> > > and override the search() methods with
> > > multi-threaded implemenations. 
> > > This would be a great contribution if someone is
> > > interested!
> > 
> > That's the solution I have in mind, since I do not
> > have multiple machines avaiable to me. Would
> > multi-threads on a single disk produces any
> > performance advantage? since i don't see CPU is
> > utilised 100% normally. In this case, does the
> > bottleneck happens on the mutli-thread concurrent
> > process sharing or on the single disk read? if its
> the
> > latter, probably, I can get away by putting
> multiple
> > indexes on seperate hard disks, but still runs on
> > single CPU. 
> > 
> > Are you going to have a try on building a
> > multi-threaded MultiSearcher? if so, what's the
> > timescale you have in mind? I may have to look
> into
> > myself if I don't find out any better solutions on
> my
> > 500+ terms search problems.   
> > 
> > > 
> > > The parallel approach I prefer is to maintain a
> set
> > > of indexes, each on 
> > > a separate machine, then use something like a
> > > ParallelSearcher of 
> > > RemoteSearchables to search them all.
> > > 
> > > Doug
> > > 
> > > Jie Yang wrote:
> > > > I had a thought on my earlier post on "Poor
> > > > Performance when searching for 500+ terms". 
> > > > 
> > > > The problem is on how to improve the
> performance
> > > when
> > > > searching for 500+ OR search terms. i.e. enter
> a
> > > > search string of :
> > > > 
> > > > W1 OR W2 OR W3 OR ...... OR w500.
> > > > 
> > > > I thought I could rewrite the MultiSearcher
> class
> > > so
> > > > that it can initiate multiple parallel
> > > IndexSearchers
> > > > to perform the search.
> > > > 
> > > > Solution 1 would be divide the query string of
> > > "500 OR
> > > > conditions terms" into 25 "20 OR conditions
> > > terms",
> > > > and then pass them to MultiSearcher,
> MultiSearcher
> > > > then initiate 25 threads to search on a single
> > > index
> > > > directory.
> > > > 
> > > > Solution 2 would be when building an index of
> 1
> > > > million docs, instead of building one single
> index
> > > > containing 1 million docs,
> > > > build 10 index directory eaching containing
> 100K
> > > > records. then I pass a single query string of
> "500
> > > OR
> > > > conditions terms" to
> > > > MultiSearcher, MultiSearcher then initiate 10
> > > threads
> > > > to search for 10 different index directories. 
> > > > 
> > > > Has anyone tried something similar, which
> solution
> > > > would be a better one. Also is using multiple
> > > threads
> > > > on a single directory a good ideal? Are there
> any
> > > > bottlenecks for threads acessing resources, or
> > > > I better pass requests into different
> processes. 
> > > > 
> > > > Thanks a lot
> > > > 
> > > > 
> > > > 
> > > > 
> > > >  --- Jie Yang <[EMAIL PROTECTED]> wrote:
> >
> > > Thanks
> > > > Julian
> > > > 
> > > >>I am not using RAMDirectory due to the large
> size
> > > of
> > > >>index file. the index generated on hard disc
> is
> > > >>1.57G
> > > >>for 1 million documents, each document has
> average
> > > >>500
> > > >>terms. I am using Field.UnStored(fieldName,
> > > terms),
> > > >>so
> > > >>i beliece I am not storing the documents, just
> the
> > > >>index. (is that right?) is there anyway to
> reduce
> > > >>the
> > > >>index size created? also What is the maximum
> size
> > > of
> > > >>data can be stored in RAMDirectory? I suppose
> I
> > > >>could
> > > >>get a 10G RAM solaris box, but would that be
> > > >>advisable
> > > >>say storing 2-3G of index data in memory?
> Also,
> > > what
> > > >>is the performance boost factor when
> RAMDirectory
> > > >>comparing to FSDirectory. Are we taling about
> >
> > > 100%
> > > >>here?
> > > >>
> > > >>On your 2nd and 3rd suggestion, I probably run
> the
> > > >>latest code that includes the fix by Dmitry
> > > >>Serebrennikov, the build was checked out from
> CVS
> > > >>yesterday. and I used a QueryParser similar to
> the
> > > >>one
> > > >>used in the demo code.
> > > >>
> > > >>Again, I still feel a bit curious and want to
> find
> > > >>out
> > > >>does lucene do (or in the future) pre-filter
> on
> > > "AND
> > > >>join conditions". For example, A AND (B OR C
> OR
> > > D).
> > > >>if
> > > >>A finds 100 docs out of 1 million, can lucene
> > > >>restrict
> 
=== message truncated === 

________________________________________________________________________
Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://mail.messenger.yahoo.co.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two possible solutions on Parallel Searching

Reply via email to