I was reading of scaling in Lucene with Remote Parallel Multisearcher. I
have not tried this beast yet and would be interested in hearing from
anyone who has attempted it use. I see that there have been some
previous posts about it a couple of years back. I think if something
like this could work, it may be possible.
Regards,
David
Pete wrote:
On Thursday April 5 2007 10:43 am, David Pratt wrote:
Hi Pete. Many thanks for this advice. It would seem that perhaps a
cluster would best solve this and then spread over some number of lower
end servers. From what i read on large indexing, this seems to be the
approach (but with as much RAM as possible per server). I am looking at
costs so the lower end 2GB RAM servers are attractive but just use more
of them.
I have only used pylucene for tests on smaller indexes. Is a cluster
arrangement possible using pylucene? I am not a java programmer so would
like to stay with what I know. Many thanks.
For indexing? Not really sure how'd that work. If you want to serve all
searches for all of the documents off one box, you're gonna have to move all
of the indexes together at some point. It's possible to use multiple servers
to create indexes, ship them to a single box and then merge.
As for searching a collection this large, your options are either Big Iron or
distribution. Google's pretty convincingly demonstrated that the later is
the way to go. Hadoop (http://lucene.apache.org/hadoop/about.html) is a
lucene-based platform for doing exactly this, but it's a) Java b) nowhere
near done. I believe http://hyperestraier.sourceforge.net/ has support for
distribution (and Python bindings) but I haven't tried it.
The short version: if you can partition your index into logically distinct
chunks and have no need to perform searches across these chunks, distribution
is pretty straightforward - it's really just setting up a bunch of small
servers. If you can't partition your data this way, the problem is much
harder. AFAIK (and I've done quite a lot of research), there is no mature
OSS package to do this in any language (and certainly not Python). There are
a number of commercial solutions, including http://www.dieselpoint.com/
(Java, but interoperable).
See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is
there PyNutch?" from February 19 in the archives for a discussion of some of
these issues.
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev