I hit send before I meant to, so here's one more piece of performance data.
Indexing a repo on ipkg takes about 40min with existing methods and has a size of 951M and a RSS of 947M. Indexing with pylucene takes roughly 2 hours, whether it's has a size of 155M or a size of 1100M. (Specifically, the 155M run took 2 hours and 1 minute while the 1100M run took 1 hour and 49 minutes.) Brock Brock Pytlik wrote: > Over the past couple of weeks or so I've been looking into switching our > search back end to use PyLucene. I've now got a working prototype which > passes the test suite and I've been experimenting with it recently to > check out its performance. After all that, I'm not sure which direction > makes sense going forward, whether to make the switch or instead try to > improve our existing back end. > > The one sentence summary is that PyLucene is more flexible and offers > functionality that would take substantial effort for us to engineer but > has RAM and disk footprints that are heavier than the current > implementations and doesn't offering overwhelming speed improvements. If > we went with PyLucene I could work on making search so that it returns > the entire action and updating the API's to use that ability as best > they could. If we stay with the current approach, then I would work on > speeding update and laying the ground work to handle the critical > features like boolean queries and structured search (which would give us > the ability to search against versions, and with a bit more extension, > against incorporations). > > What I'm looking for from everyone is some views on whether the > footprints I'm seeing from PyLucene are just to heavy or not. I have > some ideas about how to reduce the footprint of PyLucene, at least a > small amount, but I don't expect substantial changes, especially not for > the memory growth during search. > > In detail, here's what I've found. > > Reasons for switching to PyLucene: > Large variety of desired queries preexisting, including boolean and > structured queries which would need to be implemented in the other > engine in the near future and which are not trivial to do. > > Somewhat faster searching locally (1.0 secs vs 1.4 roughly). > > It already correctly handles locking indexes and having readers update > on the fly. Multiple readers can have the same index open at the same time. > > Easier control of RAM/time tradeoffs. > > Depot RAM usage not dependent on size of index. > > It's likely to scale better in terms of speed for local search, and > possibly for remote search as well. > > > > > Reasons for sticking with existing approach: > Smaller indexes, at least so far. (40M vs 240M on my local system, 272M > vs 4.2G on ipkg as reported by du) > > Constant depot memory usage for all searches. Using pylucene makes the > depot grow when searches are done for things like p* (up to 710 size, > 650M rss). > > Faster search for things p*. (30 seconds vs 2 minutes) though on normal > queries, times seem comparable. > > More predictable behavior for queries. PyLucene preexpands wildcard > queries and requires a max clause count number to be set. Even at 100000 > a search against ipkg for '(1.6.0_06)*' broke this limit. Turning this > number higher had negative effects on performance from what I observed. > > > > On the subject of faster index update, I think the jury is out. If > pylucene doesn't optimize the index after each install, then it's > substantially faster than the current implementation, but not faster > than I think a fairly simple adjustment to the current implementation > would be so that it also didn't optimize the index after each installation. > > Thanks for your time, I'm looking forward to hearing what everyone thinks. > > Brock > _______________________________________________ > pkg-discuss mailing list > [email protected] > http://mail.opensolaris.org/mailman/listinfo/pkg-discuss > _______________________________________________ pkg-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
