On Tue, Dec 16, 2008 at 04:30:22PM -0800, Brock Pytlik wrote:
> Over the past couple of weeks or so I've been looking into switching our 
> search back end to use PyLucene. I've now got a working prototype which 
> passes the test suite and I've been experimenting with it recently to 
> check out its performance. After all that, I'm not sure which direction 
> makes sense going forward, whether to make the switch or instead try to 
> improve our existing back end.
> 
> The one sentence summary is that PyLucene is more flexible and offers 
> functionality that would take substantial effort for us to engineer but 
> has RAM and disk footprints that are heavier than the current 
> implementations and doesn't offering overwhelming speed improvements. If 
> we went with PyLucene I could work on making search so that it returns 
> the entire action and updating the API's to use that ability as best 
> they could. If we stay with the current approach, then I would work on 
> speeding update and laying the ground work to handle the critical 
> features like boolean queries and structured search (which would give us 
> the ability to search against versions, and with a bit more extension, 
> against incorporations).
> 
> What I'm looking for from everyone is some views on whether the 
> footprints I'm seeing from PyLucene are just to heavy or not. I have 
> some ideas about how to reduce the footprint of PyLucene, at least a 
> small amount, but I don't expect substantial changes, especially not for 
> the memory growth during search.
> 
> In detail, here's what I've found.
> 
> Reasons for switching to PyLucene:
> Large variety of desired queries preexisting, including boolean and 
> structured queries which would need to be implemented in the other 
> engine in the near future and which are not trivial to do.
> 
> Somewhat faster searching locally (1.0 secs vs 1.4 roughly).
> 
> It already correctly handles locking indexes and having readers update 
> on the fly. Multiple readers can have the same index open at the same time.
> 
> Easier control of RAM/time tradeoffs.
> 
> Depot RAM usage not dependent on size of index.
> 
> It's likely to scale better in terms of speed for local search, and 
> possibly for remote search as well.

I think the index size/etc would change quite a bit if we looked at
searching (for some cases) a little bit differently.  Lucene is really
good with free text search on documents, and it sounds like we may be
giving it lots of small documents (actions?) which it adds a lot of
metadata to.

If we thought of a document as "(fmri, short description, long
description)" and indexed something like that, then the index size (and
mem usage) should shrink considerably.  Indexing time should also
improve quite a bit :)

The current search is very useful for finding binaries or include files
(the stuff in the packages), so I don't want to see that change.  But it
would also be quite useful to have a way to search only descriptions and
get a list of packages matching the keywords specified.

An example use case is something like "pkg search irc", which I want to
show me all packages that mention irc in their description.  I want
descriptions containing irc and a list of the corresponding packages
"irssi, bitchx, ircII, xchat, pidgin, ircd, ...".  Then if we were
looking for specific files we could do "pkg search -f bin/irc".

Maybe some combination of pylucene (or some other python search engine)
for text search on the "documents" mentioned above, and the current
search for files/actions, would work well.  Alternatively maybe we can
group actions for an fmri and index the group as a document with
pylucene, and then filter on the way out.  Might shrink index size but
could be too expensive.

Just my $0.02, YMMV.

Thanks,
Brad
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to