Brock,
PyLucene is a front end for Lucene, which is written in Java. So would
using this mean that a JRE would be required to run pkg(5)?
Tom
Brock Pytlik wrote:
I hit send before I meant to, so here's one more piece of performance data.
Indexing a repo on ipkg takes about 40min with existing methods and has
a size of 951M and a RSS of 947M. Indexing with pylucene takes roughly 2
hours, whether it's has a size of 155M or a size of 1100M.
(Specifically, the 155M run took 2 hours and 1 minute while the 1100M
run took 1 hour and 49 minutes.)
Brock
Brock Pytlik wrote:
Over the past couple of weeks or so I've been looking into switching our
search back end to use PyLucene. I've now got a working prototype which
passes the test suite and I've been experimenting with it recently to
check out its performance. After all that, I'm not sure which direction
makes sense going forward, whether to make the switch or instead try to
improve our existing back end.
The one sentence summary is that PyLucene is more flexible and offers
functionality that would take substantial effort for us to engineer but
has RAM and disk footprints that are heavier than the current
implementations and doesn't offering overwhelming speed improvements. If
we went with PyLucene I could work on making search so that it returns
the entire action and updating the API's to use that ability as best
they could. If we stay with the current approach, then I would work on
speeding update and laying the ground work to handle the critical
features like boolean queries and structured search (which would give us
the ability to search against versions, and with a bit more extension,
against incorporations).
What I'm looking for from everyone is some views on whether the
footprints I'm seeing from PyLucene are just to heavy or not. I have
some ideas about how to reduce the footprint of PyLucene, at least a
small amount, but I don't expect substantial changes, especially not for
the memory growth during search.
In detail, here's what I've found.
Reasons for switching to PyLucene:
Large variety of desired queries preexisting, including boolean and
structured queries which would need to be implemented in the other
engine in the near future and which are not trivial to do.
Somewhat faster searching locally (1.0 secs vs 1.4 roughly).
It already correctly handles locking indexes and having readers update
on the fly. Multiple readers can have the same index open at the same time.
Easier control of RAM/time tradeoffs.
Depot RAM usage not dependent on size of index.
It's likely to scale better in terms of speed for local search, and
possibly for remote search as well.
Reasons for sticking with existing approach:
Smaller indexes, at least so far. (40M vs 240M on my local system, 272M
vs 4.2G on ipkg as reported by du)
Constant depot memory usage for all searches. Using pylucene makes the
depot grow when searches are done for things like p* (up to 710 size,
650M rss).
Faster search for things p*. (30 seconds vs 2 minutes) though on normal
queries, times seem comparable.
More predictable behavior for queries. PyLucene preexpands wildcard
queries and requires a max clause count number to be set. Even at 100000
a search against ipkg for '(1.6.0_06)*' broke this limit. Turning this
number higher had negative effects on performance from what I observed.
On the subject of faster index update, I think the jury is out. If
pylucene doesn't optimize the index after each install, then it's
substantially faster than the current implementation, but not faster
than I think a fairly simple adjustment to the current implementation
would be so that it also didn't optimize the index after each installation.
Thanks for your time, I'm looking forward to hearing what everyone thinks.
Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss