Over the past couple of weeks or so I've been looking into switching our 
search back end to use PyLucene. I've now got a working prototype which 
passes the test suite and I've been experimenting with it recently to 
check out its performance. After all that, I'm not sure which direction 
makes sense going forward, whether to make the switch or instead try to 
improve our existing back end.

The one sentence summary is that PyLucene is more flexible and offers 
functionality that would take substantial effort for us to engineer but 
has RAM and disk footprints that are heavier than the current 
implementations and doesn't offering overwhelming speed improvements. If 
we went with PyLucene I could work on making search so that it returns 
the entire action and updating the API's to use that ability as best 
they could. If we stay with the current approach, then I would work on 
speeding update and laying the ground work to handle the critical 
features like boolean queries and structured search (which would give us 
the ability to search against versions, and with a bit more extension, 
against incorporations).

What I'm looking for from everyone is some views on whether the 
footprints I'm seeing from PyLucene are just to heavy or not. I have 
some ideas about how to reduce the footprint of PyLucene, at least a 
small amount, but I don't expect substantial changes, especially not for 
the memory growth during search.

In detail, here's what I've found.

Reasons for switching to PyLucene:
Large variety of desired queries preexisting, including boolean and 
structured queries which would need to be implemented in the other 
engine in the near future and which are not trivial to do.

Somewhat faster searching locally (1.0 secs vs 1.4 roughly).

It already correctly handles locking indexes and having readers update 
on the fly. Multiple readers can have the same index open at the same time.

Easier control of RAM/time tradeoffs.

Depot RAM usage not dependent on size of index.

It's likely to scale better in terms of speed for local search, and 
possibly for remote search as well.




Reasons for sticking with existing approach:
Smaller indexes, at least so far. (40M vs 240M on my local system, 272M 
vs 4.2G on ipkg as reported by du)

Constant depot memory usage for all searches. Using pylucene makes the 
depot grow when searches are done for things like p* (up to 710 size, 
650M rss).

Faster search for things p*. (30 seconds vs 2 minutes) though on normal 
queries, times seem comparable.

More predictable behavior for queries. PyLucene preexpands wildcard 
queries and requires a max clause count number to be set. Even at 100000 
a search against ipkg for '(1.6.0_06)*' broke this limit. Turning this 
number higher had negative effects on performance from what I observed.



On the subject of faster index update, I think the jury is out. If 
pylucene doesn't optimize the index after each install, then it's 
substantially faster than the current implementation, but not faster 
than I think a fairly simple adjustment to the current implementation 
would be so that it also didn't optimize the index after each installation.

Thanks for your time, I'm looking forward to hearing what everyone thinks.

Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to