On Tue, Dec 16, 2008 at 04:30:22PM -0800, Brock Pytlik wrote: > Over the past couple of weeks or so I've been looking into switching our > search back end to use PyLucene. I've now got a working prototype which > passes the test suite and I've been experimenting with it recently to > check out its performance. After all that, I'm not sure which direction > makes sense going forward, whether to make the switch or instead try to > improve our existing back end. > > The one sentence summary is that PyLucene is more flexible and offers > functionality that would take substantial effort for us to engineer but > has RAM and disk footprints that are heavier than the current > implementations and doesn't offering overwhelming speed improvements. If > we went with PyLucene I could work on making search so that it returns > the entire action and updating the API's to use that ability as best > they could. If we stay with the current approach, then I would work on > speeding update and laying the ground work to handle the critical > features like boolean queries and structured search (which would give us > the ability to search against versions, and with a bit more extension, > against incorporations). > > What I'm looking for from everyone is some views on whether the > footprints I'm seeing from PyLucene are just to heavy or not. I have > some ideas about how to reduce the footprint of PyLucene, at least a > small amount, but I don't expect substantial changes, especially not for > the memory growth during search. > > In detail, here's what I've found. > > Reasons for switching to PyLucene: > Large variety of desired queries preexisting, including boolean and > structured queries which would need to be implemented in the other > engine in the near future and which are not trivial to do. > > Somewhat faster searching locally (1.0 secs vs 1.4 roughly). > > It already correctly handles locking indexes and having readers update > on the fly. Multiple readers can have the same index open at the same time. > > Easier control of RAM/time tradeoffs. > > Depot RAM usage not dependent on size of index. > > It's likely to scale better in terms of speed for local search, and > possibly for remote search as well.
I think the index size/etc would change quite a bit if we looked at searching (for some cases) a little bit differently. Lucene is really good with free text search on documents, and it sounds like we may be giving it lots of small documents (actions?) which it adds a lot of metadata to. If we thought of a document as "(fmri, short description, long description)" and indexed something like that, then the index size (and mem usage) should shrink considerably. Indexing time should also improve quite a bit :) The current search is very useful for finding binaries or include files (the stuff in the packages), so I don't want to see that change. But it would also be quite useful to have a way to search only descriptions and get a list of packages matching the keywords specified. An example use case is something like "pkg search irc", which I want to show me all packages that mention irc in their description. I want descriptions containing irc and a list of the corresponding packages "irssi, bitchx, ircII, xchat, pidgin, ircd, ...". Then if we were looking for specific files we could do "pkg search -f bin/irc". Maybe some combination of pylucene (or some other python search engine) for text search on the "documents" mentioned above, and the current search for files/actions, would work well. Alternatively maybe we can group actions for an fmri and index the group as a document with pylucene, and then filter on the way out. Might shrink index size but could be too expensive. Just my $0.02, YMMV. Thanks, Brad _______________________________________________ pkg-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
