On Jun 1, 2006, at 6:15 AM, David Balmain wrote: > >>> This proved a lot more difficult so I decided to >>> take a different route. Marvin Humphrey (author of KinoSearch, a >>> perl >>> port of lucene) and I are about to start a new project at Apache >>> called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal) >>> which >>> will aim to create a C port of Lucene that can be used as a >>> backend in >>> all dynamic languages. This time around, portability will be a much >>> higher priority. >> I'm sure you've considered this, but what does that add compared to a >> GCJ+SWIG approach, as with PyLucene? Without having looked at it, is >> there anything which prevents that method from being applied to Ruby? > > It can be done but it's still a lot of work and I just didn't feel up > to the task. Plus we get better performance this way with a much > smaller download.
Java Lucene is built on the assumption, quite reasonable for Java as a compiled language[1], that method calls are cheap and object creation and destruction are cheap. The fact that they are much more expensive in an interpreted language is the main reason the pure-Perl port of Lucene, Plucene, runs so slowly (<http://www.rectangular.com/ kinosearch/benchmarks.html>). Lack of access to primitive data types such as int is another reason, but it's actually not that great a factor compared to the OO overhead (I did extensive hacking on Plucene before deciding I had no choice but to start from scratch, and rewriting the IO classes in C didn't help as much as anyone expected). Presumably similar factors are at work slowing down the pure-Ruby Ferret. The OO overhead problems are mitigated by going the GCJ route, but not eliminated. Say you want to subclass Analyzer -- which most significant deployments of Lucene will want to do eventually. The way a TokenStream works in Lucene, several method calls are required for each and every token -- one for each Analyzer the token passes through. That gets extremely expensive in an interpreted language. Furthermore, none of Perl's native string manipulation tools work with UTF-16 strings. So if you wanted to, say, insert a custom Perl TokenFilter into a Lucene Analysis chain, you'd have to translate between UTF-8 and UTF-16 each time you cross the Perl/Java boundary, making the TokenStream concept a double disaster. An alternate way of processing Tokens is to have each link in the Analyzer chain accept a "TokenBatch" instead of a TokenStream: an array of Tokens, rather than a stream of Tokens. That way, each Analyzer can iterate over all the Tokens in a tight loop, either natively or in C. The downside of this technique is that it's not possible to feed it directly from a filehandle/Reader, but that's small potatoes. It would be possible to graft the TokenBatch concept onto a GCJ'd Lucene: create a native full analysis chain which spits out a TokenBatch, then have the TokenBatch pretend it's a TokenStream, feeding Tokens to Lucene using a C version of next(). That would perform OK -- but you couldn't ever mix and match Java Lucene Analyzers with native Analyzers, only prepend the native onto the front. Therefore, you'd have to rewrite the entire org.apache.lucene.analysis package anyway -- it's the only way you're going to get both full flexibility and performance. And once you've started down the path of rewriting large portions of Lucene, it's hard to see why you'd put up with the headache of the GCJ approach. There are many other areas where Lucene's architecture is poorly suited for use with an interpreted language. Dave has solved those problems mainly by rewriting the whole thing in C. KinoSearch has taken that approach in some cases, but more often than Ferret, it uses modified algorithms instead. TokenBatch is one example; the best one, which is harder to explain here, is how KinoSearch merges together inverted documents during indexing. (In summary, it's faster, simpler, and requires far, far fewer objects.) It would be possible to port some of these algorithm changes to Lucene, but they would be pretty disruptive. Lucene's a mature, heavily-used library and changing anything at all requires a lot of consideration. Some of the changes I would like to see, I don't think I could lobby for in good conscience. The bytecounts-as-string- headers patch is a good example. For Ferret and KinoSearch it's adoption would yield a very significant benefit, as it would open the door to using Luke to browse indexes. For Java Lucene, though, it can only be justified by further changes which build upon it. The downside of the full-port approach that Dave and I have taken is that it's a lot of work to build and maintain. However, we've already done the vast majority of the up-front work once. Re-doing it for Lucy will be a cakewalk in comparison. The maintenance problem that KinoSearch and Ferret currently face, we're addressing by sharing the C core. We would not be surprised if others join us -- I know of at least one other person who rewrote Lucene in C: Robert Kirchgessner, who did a partial PHP/C port. Heck, it will presumably be easier to maintain a Python port against Lucy than against GCJ'd Lucene, provided that we achieve what we've set out to achieve. The only question remaining, I think, is whether the project will actually be hosted at Apache. When Dave and I approached Doug Cutting about it, he specifically requested that development take place there -- before Dave or I had had a chance to indicate that that was our preference as well. However, we've been waiting for approval by the Lucene PMC for a couple weeks now, and I'm not sure its coming. I'm guessing that Erik "One Lucene To Rule Them All" Hatcher hasn't cast his +1. ;) IMO, it would be best for everybody if we did this within the Lucene family, but we'll just have to see. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] What constitutes a compiled vs. a dynamic language is debatable -- see <http://en.wikipedia.org/wiki/Interpreted_language>. It might be more accurate to describe Java as a "more compiled" language. _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

