Over the past couple of weeks or so I've been looking into switching our search back end to use PyLucene. I've now got a working prototype which passes the test suite and I've been experimenting with it recently to check out its performance. After all that, I'm not sure which direction makes sense going forward, whether to make the switch or instead try to improve our existing back end.
The one sentence summary is that PyLucene is more flexible and offers functionality that would take substantial effort for us to engineer but has RAM and disk footprints that are heavier than the current implementations and doesn't offering overwhelming speed improvements. If we went with PyLucene I could work on making search so that it returns the entire action and updating the API's to use that ability as best they could. If we stay with the current approach, then I would work on speeding update and laying the ground work to handle the critical features like boolean queries and structured search (which would give us the ability to search against versions, and with a bit more extension, against incorporations). What I'm looking for from everyone is some views on whether the footprints I'm seeing from PyLucene are just to heavy or not. I have some ideas about how to reduce the footprint of PyLucene, at least a small amount, but I don't expect substantial changes, especially not for the memory growth during search. In detail, here's what I've found. Reasons for switching to PyLucene: Large variety of desired queries preexisting, including boolean and structured queries which would need to be implemented in the other engine in the near future and which are not trivial to do. Somewhat faster searching locally (1.0 secs vs 1.4 roughly). It already correctly handles locking indexes and having readers update on the fly. Multiple readers can have the same index open at the same time. Easier control of RAM/time tradeoffs. Depot RAM usage not dependent on size of index. It's likely to scale better in terms of speed for local search, and possibly for remote search as well. Reasons for sticking with existing approach: Smaller indexes, at least so far. (40M vs 240M on my local system, 272M vs 4.2G on ipkg as reported by du) Constant depot memory usage for all searches. Using pylucene makes the depot grow when searches are done for things like p* (up to 710 size, 650M rss). Faster search for things p*. (30 seconds vs 2 minutes) though on normal queries, times seem comparable. More predictable behavior for queries. PyLucene preexpands wildcard queries and requires a max clause count number to be set. Even at 100000 a search against ipkg for '(1.6.0_06)*' broke this limit. Turning this number higher had negative effects on performance from what I observed. On the subject of faster index update, I think the jury is out. If pylucene doesn't optimize the index after each install, then it's substantially faster than the current implementation, but not faster than I think a fairly simple adjustment to the current implementation would be so that it also didn't optimize the index after each installation. Thanks for your time, I'm looking forward to hearing what everyone thinks. Brock _______________________________________________ pkg-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
