On Thursday April 5 2007 10:43 am, David Pratt wrote:
> Hi Pete. Many thanks for this advice. It would seem that perhaps a
> cluster would best solve this and then spread over some number of lower
> end servers. From what i read on large indexing, this seems to be the
> approach (but with as much RAM as possible per server). I am looking at
> costs so the lower end 2GB RAM servers are attractive but just use more
> of them.
>
> I have only used pylucene for tests on smaller indexes. Is a cluster
> arrangement possible using pylucene? I am not a java programmer so would
> like to stay with what I know. Many thanks.

For indexing?  Not really sure how'd that work.  If you want to serve all 
searches for all of the documents off one box, you're gonna have to move all 
of the indexes together at some point.  It's possible to use multiple servers 
to create indexes, ship them to a single box and then merge.

As for searching a collection this large, your options are either Big Iron or 
distribution.  Google's pretty convincingly demonstrated that the later is 
the way to go.  Hadoop (http://lucene.apache.org/hadoop/about.html) is a 
lucene-based platform for doing exactly this, but it's a) Java b) nowhere 
near done.  I believe http://hyperestraier.sourceforge.net/ has support for 
distribution (and Python bindings) but I haven't tried it.

The short version: if you can partition your index into logically distinct 
chunks and have no need to perform searches across these chunks, distribution 
is pretty straightforward - it's really just setting up a bunch of small 
servers.  If you can't partition your data this way, the problem is much 
harder.  AFAIK (and I've done quite a lot of research), there is no mature 
OSS package to do this in any language (and certainly not Python).  There are 
a number of commercial solutions, including http://www.dieselpoint.com/ 
(Java, but interoperable).

See my message title "Distributed Indexes, Pycon, was Re: [pylucene-dev] Is 
there PyNutch?" from February 19 in the archives for a discussion of some of 
these issues. 

-- 
Peter Fein   ||   773-575-0694   ||   [EMAIL PROTECTED]
http://www.pobox.com/~pfein/   ||   PGP: 0xCCF6AE6B
irc: [EMAIL PROTECTED]   ||   jabber: [EMAIL PROTECTED]
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to