I was reading of scaling in Lucene with Remote Parallel Multisearcher. I have not tried this beast yet and would be interested in hearing from anyone who has attempted it use. I see that there have been some previous posts about it a couple of years back. I think if something like this could work, it may be possible.

Regards,
David


Pete wrote:
On Thursday April 5 2007 10:43 am, David Pratt wrote:
Hi Pete. Many thanks for this advice. It would seem that perhaps a
cluster would best solve this and then spread over some number of lower
end servers. From what i read on large indexing, this seems to be the
approach (but with as much RAM as possible per server). I am looking at
costs so the lower end 2GB RAM servers are attractive but just use more
of them.

I have only used pylucene for tests on smaller indexes. Is a cluster
arrangement possible using pylucene? I am not a java programmer so would
like to stay with what I know. Many thanks.

For indexing? Not really sure how'd that work. If you want to serve all searches for all of the documents off one box, you're gonna have to move all of the indexes together at some point. It's possible to use multiple servers to create indexes, ship them to a single box and then merge.

As for searching a collection this large, your options are either Big Iron or distribution. Google's pretty convincingly demonstrated that the later is the way to go. Hadoop (http://lucene.apache.org/hadoop/about.html) is a lucene-based platform for doing exactly this, but it's a) Java b) nowhere near done. I believe http://hyperestraier.sourceforge.net/ has support for distribution (and Python bindings) but I haven't tried it.

The short version: if you can partition your index into logically distinct chunks and have no need to perform searches across these chunks, distribution is pretty straightforward - it's really just setting up a bunch of small servers. If you can't partition your data this way, the problem is much harder. AFAIK (and I've done quite a lot of research), there is no mature OSS package to do this in any language (and certainly not Python). There are a number of commercial solutions, including http://www.dieselpoint.com/ (Java, but interoperable).

See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is there PyNutch?" from February 19 in the archives for a discussion of some of these issues.
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to