On Wednesday February 14 2007 1:06 pm, Jack L wrote: > Hello Brian, > > Thanks for the reply. (I'm not sure if this discussion is interesting > to PyLucene dev list. If it's considered OT, I shall take the next > email offline.)
I consider this on-topic, largely cause I'm interested & there's nowhere else to discuss it. ;) > I looked at the first link you sent. It's not actually what I'm > looking for. In our set up, we have multiple crawler/indexer/searcher > boxes talking to one merger/web server front-end using Nutch IPC. > The front-end box sends queries to multiple back-end searchers and > merge the results it has received, and presents them in a web page. > I'm hoping to find a way to replace the front-end Java implementation > with Python. So, the piece I'm looking for does not touch the > segments. Instead, it speaks Nutch IPC and parses the query > strings, issues queries to the back-end, and merges results and puts > them in a web page. I've been kicking around this sort of idea around with my coworkers recently. While I haven't used Nutch/Solr, we've used techniques from the later. Some background: We're a python shop [0]. In general, we're working with relatively small data sets, but running rather complex queries and pre-indexing analysis. We use an in-house spider and a Python webserver. The front-end runs on a local copy of a PyLucene index updated via the Solr in-process technique [1] We're starting to push up against the capacity limits of querying on a single server and are thinking about how to partition the index to multiple boxes. In Java, this appears to be done using a MultiSearcher and RemoteSearchable. The later is implemented on Java RMI [2], which is not in PyLucene [3]. Here, the fun begins. As best I can tell, MultiSearcher/RemoteSearchable require multiple calls to slave machines per query. The general thought would be to re-implement such a thing in Python, using something like Perpsective Broker [4]. I don't really want to do this, however, as it just doesn't sound like my idea of a good time. I'm starting to formulate some thoughts for alternate approaches, but haven't totally sorted it out. So, the question on everyone's mind is: For all you folks using PyLucene for *queries*, how do you scale beyond a single machine? Anyone going to PyCon? Want to have a Birds of a Feather on Lucene/Text Search/Distributed Computing? [5] --Pete [0] Interested? We're hiring. This message only hints at the sorts of problems we're trying to solve. Contact me off-list. [1] http://wiki.apache.org/solr/CollectionDistribution . This was not the easiest thing in the world to get working acceptably with PyLucene, though that appears to have more to do with Boehm GC. It also requires about 2x the RAM during the switchover period and beats on the disk. [2] http://lucene.apache.org/java/docs/api/org/apache/lucene/search/MultiSearcher.html http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RemoteSearchable.html http://java.sun.com/j2se/1.4.2/docs/api/java/rmi/server/UnicastRemoteObject.html [3] http://www.archivesat.com/pylucene_developers/thread323504.htm [4] http://twistedmatrix.com/projects/core/documentation/howto/pb-intro.html [5] http://us.pycon.org/TX2007/BoF _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
