I hit send before I meant to, so here's one more piece of performance data.

Indexing a repo on ipkg takes about 40min with existing methods and has 
a size of 951M and a RSS of 947M. Indexing with pylucene takes roughly 2 
hours, whether it's has a size of 155M or a size of 1100M. 
(Specifically, the 155M run took 2 hours and 1 minute while the 1100M 
run took 1 hour and 49 minutes.)

Brock

Brock Pytlik wrote:
> Over the past couple of weeks or so I've been looking into switching our 
> search back end to use PyLucene. I've now got a working prototype which 
> passes the test suite and I've been experimenting with it recently to 
> check out its performance. After all that, I'm not sure which direction 
> makes sense going forward, whether to make the switch or instead try to 
> improve our existing back end.
>
> The one sentence summary is that PyLucene is more flexible and offers 
> functionality that would take substantial effort for us to engineer but 
> has RAM and disk footprints that are heavier than the current 
> implementations and doesn't offering overwhelming speed improvements. If 
> we went with PyLucene I could work on making search so that it returns 
> the entire action and updating the API's to use that ability as best 
> they could. If we stay with the current approach, then I would work on 
> speeding update and laying the ground work to handle the critical 
> features like boolean queries and structured search (which would give us 
> the ability to search against versions, and with a bit more extension, 
> against incorporations).
>
> What I'm looking for from everyone is some views on whether the 
> footprints I'm seeing from PyLucene are just to heavy or not. I have 
> some ideas about how to reduce the footprint of PyLucene, at least a 
> small amount, but I don't expect substantial changes, especially not for 
> the memory growth during search.
>
> In detail, here's what I've found.
>
> Reasons for switching to PyLucene:
> Large variety of desired queries preexisting, including boolean and 
> structured queries which would need to be implemented in the other 
> engine in the near future and which are not trivial to do.
>
> Somewhat faster searching locally (1.0 secs vs 1.4 roughly).
>
> It already correctly handles locking indexes and having readers update 
> on the fly. Multiple readers can have the same index open at the same time.
>
> Easier control of RAM/time tradeoffs.
>
> Depot RAM usage not dependent on size of index.
>
> It's likely to scale better in terms of speed for local search, and 
> possibly for remote search as well.
>
>
>
>
> Reasons for sticking with existing approach:
> Smaller indexes, at least so far. (40M vs 240M on my local system, 272M 
> vs 4.2G on ipkg as reported by du)
>
> Constant depot memory usage for all searches. Using pylucene makes the 
> depot grow when searches are done for things like p* (up to 710 size, 
> 650M rss).
>
> Faster search for things p*. (30 seconds vs 2 minutes) though on normal 
> queries, times seem comparable.
>
> More predictable behavior for queries. PyLucene preexpands wildcard 
> queries and requires a max clause count number to be set. Even at 100000 
> a search against ipkg for '(1.6.0_06)*' broke this limit. Turning this 
> number higher had negative effects on performance from what I observed.
>
>
>
> On the subject of faster index update, I think the jury is out. If 
> pylucene doesn't optimize the index after each install, then it's 
> substantially faster than the current implementation, but not faster 
> than I think a fairly simple adjustment to the current implementation 
> would be so that it also didn't optimize the index after each installation.
>
> Thanks for your time, I'm looking forward to hearing what everyone thinks.
>
> Brock
> _______________________________________________
> pkg-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
>   

_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to