Distributed Indexes, Pycon, was Re: [pylucene-dev] Is there PyNutch?

Pete Mon, 19 Feb 2007 15:45:37 -0800

On Wednesday February 14 2007 1:06 pm, Jack L wrote:
> Hello Brian,
>
> Thanks for the reply. (I'm not sure if this discussion is interesting
> to PyLucene dev list. If it's considered OT, I shall take the next
> email offline.)


I consider this on-topic, largely cause I'm interested & there's nowhere else 
to discuss it. ;)

> I looked at the first link you sent. It's not actually what I'm
> looking for. In our set up, we have multiple crawler/indexer/searcher
> boxes talking to one merger/web server front-end using Nutch IPC.
> The front-end box sends queries to multiple back-end searchers and
> merge the results it has received, and presents them in a web page.
> I'm hoping to find a way to replace the front-end Java implementation
> with Python. So, the piece I'm looking for does not touch the
> segments. Instead, it speaks Nutch IPC and parses the query
> strings, issues queries to the back-end, and merges results and puts
> them in a web page.

I've been kicking around this sort of idea around with my coworkers recently.  
While I haven't used Nutch/Solr, we've used techniques from the later.

Some background:
We're a python shop [0].  In general, we're working with relatively small data 
sets, but running rather complex queries  and pre-indexing analysis.  We use 
an in-house spider and a Python webserver.  The front-end runs on a local 
copy of a PyLucene index updated via  the Solr in-process technique [1]

We're starting to push up against the capacity limits of querying on a single 
server and are thinking about how to partition the index to multiple boxes. In 
Java, this appears to be done using a MultiSearcher and RemoteSearchable. 
The later is implemented on Java RMI [2], which is not in PyLucene [3].

Here, the fun begins.  As best I can tell, MultiSearcher/RemoteSearchable 
require multiple calls to slave machines per query. The general thought would 
be to re-implement such a thing in Python, using something like Perpsective 
Broker [4].

I don't really want to do this, however, as it just doesn't sound like my idea 
of a good time.  I'm starting to formulate some thoughts for alternate 
approaches, but haven't totally sorted it out.  So, the question on 
everyone's mind is:

For all you folks using PyLucene for *queries*, how do you scale beyond a 
single machine?

Anyone going to PyCon?  Want to have a Birds of a Feather on Lucene/Text 
Search/Distributed Computing? [5]

--Pete

[0] Interested?  We're hiring.  This message only hints at the sorts of 
problems we're trying to solve.  Contact me off-list.

[1] http://wiki.apache.org/solr/CollectionDistribution . This was not the 
easiest thing in the world to get working acceptably with PyLucene, though 
that appears to have more to do with Boehm GC.  It also requires about 2x the 
RAM during the switchover period and beats on the disk.

[2] 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/MultiSearcher.html
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RemoteSearchable.html
http://java.sun.com/j2se/1.4.2/docs/api/java/rmi/server/UnicastRemoteObject.html

[3] http://www.archivesat.com/pylucene_developers/thread323504.htm

[4] http://twistedmatrix.com/projects/core/documentation/howto/pb-intro.html

[5] http://us.pycon.org/TX2007/BoF
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Distributed Indexes, Pycon, was Re: [pylucene-dev] Is there PyNutch?

Reply via email to