Brad Hall wrote: > On Tue, Dec 16, 2008 at 04:30:22PM -0800, Brock Pytlik wrote: > >> Over the past couple of weeks or so I've been looking into switching our >> search back end to use PyLucene. I've now got a working prototype which >> passes the test suite and I've been experimenting with it recently to >> check out its performance. After all that, I'm not sure which direction >> makes sense going forward, whether to make the switch or instead try to >> improve our existing back end. >> >> The one sentence summary is that PyLucene is more flexible and offers >> functionality that would take substantial effort for us to engineer but >> has RAM and disk footprints that are heavier than the current >> implementations and doesn't offering overwhelming speed improvements. If >> we went with PyLucene I could work on making search so that it returns >> the entire action and updating the API's to use that ability as best >> they could. If we stay with the current approach, then I would work on >> speeding update and laying the ground work to handle the critical >> features like boolean queries and structured search (which would give us >> the ability to search against versions, and with a bit more extension, >> against incorporations). >> >> What I'm looking for from everyone is some views on whether the >> footprints I'm seeing from PyLucene are just to heavy or not. I have >> some ideas about how to reduce the footprint of PyLucene, at least a >> small amount, but I don't expect substantial changes, especially not for >> the memory growth during search. >> >> In detail, here's what I've found. >> >> Reasons for switching to PyLucene: >> Large variety of desired queries preexisting, including boolean and >> structured queries which would need to be implemented in the other >> engine in the near future and which are not trivial to do. >> >> Somewhat faster searching locally (1.0 secs vs 1.4 roughly). >> >> It already correctly handles locking indexes and having readers update >> on the fly. Multiple readers can have the same index open at the same time. >> >> Easier control of RAM/time tradeoffs. >> >> Depot RAM usage not dependent on size of index. >> >> It's likely to scale better in terms of speed for local search, and >> possibly for remote search as well. >> > > I think the index size/etc would change quite a bit if we looked at > searching (for some cases) a little bit differently. Lucene is really > good with free text search on documents, and it sounds like we may be > giving it lots of small documents (actions?) which it adds a lot of > metadata to. > > If we thought of a document as "(fmri, short description, long > description)" and indexed something like that, then the index size (and > mem usage) should shrink considerably. Indexing time should also > improve quite a bit :) > > The current search is very useful for finding binaries or include files > (the stuff in the packages), so I don't want to see that change. But it > would also be quite useful to have a way to search only descriptions and > get a list of packages matching the keywords specified. > > An example use case is something like "pkg search irc", which I want to > show me all packages that mention irc in their description. I want > descriptions containing irc and a list of the corresponding packages > "irssi, bitchx, ircII, xchat, pidgin, ircd, ...". Then if we were > looking for specific files we could do "pkg search -f bin/irc". > > Maybe some combination of pylucene (or some other python search engine) > for text search on the "documents" mentioned above, and the current > search for files/actions, would work well. Alternatively maybe we can > group actions for an fmri and index the group as a document with > pylucene, and then filter on the way out. Might shrink index size but > could be too expensive. > > Just my $0.02, YMMV. > > That's a good point. Actually, right now the indexing is done very dumbly (because of a desire to change as little code as possible to get this up and off the ground). I think it's possible to improve performance some. I'll give it some thought, but if having to have a JRE is as much of a non-starter as it seems to be, then I'm not sure it's worth the effort to pursue beyond an initial pass right now. Down the line, I could definitely see what you're proposing perhaps as a plugin to search that someone could install as an optional package. Or, perhaps it would make sense to run on the server, where minimization concerns are less important (I think?).
Brock > Thanks, > Brad > _______________________________________________ pkg-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
