Re: [Ferret-talk] Windows progress

Marvin Humphrey Thu, 01 Jun 2006 11:01:03 -0700

On Jun 1, 2006, at 6:15 AM, David Balmain wrote:
>
>>> This proved a lot more difficult so I decided to
>>> take a different route. Marvin Humphrey (author of KinoSearch, a  
>>> perl
>>> port of lucene) and I are about to start a new project at Apache
>>> called Lucy (http://wiki.apache.org/jakarta-lucene/LucyProposal)  
>>> which
>>> will aim to create a C port of Lucene that can be used as a  
>>> backend in
>>> all dynamic languages. This time around, portability will be a much
>>> higher priority.
>> I'm sure you've considered this, but what does that add compared to a
>> GCJ+SWIG approach, as with PyLucene?  Without having looked at it, is
>> there anything which prevents that method from being applied to Ruby?
>
> It can be done but it's still a lot of work and I just didn't feel up
> to the task. Plus we get better performance this way with a much
> smaller download.


Java Lucene is built on the assumption, quite reasonable for Java as  
a compiled language[1], that method calls are cheap and object  
creation and destruction are cheap.  The fact that they are much more  
expensive in an interpreted language is the main reason the pure-Perl  
port of Lucene, Plucene, runs so slowly (<http://www.rectangular.com/ 
kinosearch/benchmarks.html>).  Lack of access to primitive data types  
such as int is another reason, but it's actually not that great a  
factor compared to the OO overhead (I did extensive hacking on  
Plucene before deciding I had no choice but to start from scratch,  
and rewriting the IO classes in C didn't help as much as anyone  
expected).  Presumably similar factors are at work slowing down the  
pure-Ruby Ferret.

The OO overhead problems are mitigated by going the GCJ route, but  
not eliminated.  Say you want to subclass Analyzer -- which most  
significant deployments of Lucene will want to do eventually.  The  
way a TokenStream works in Lucene, several method calls are required  
for each and every token -- one for each Analyzer the token passes  
through.  That gets extremely expensive in an interpreted language.  
Furthermore, none of Perl's native string manipulation tools work  
with UTF-16 strings.  So if you wanted to, say, insert a custom Perl  
TokenFilter into a Lucene Analysis chain, you'd have to translate  
between UTF-8 and UTF-16 each time you cross the Perl/Java boundary,  
making the TokenStream concept a double disaster.

An alternate way of processing Tokens is to have each link in the  
Analyzer chain accept a "TokenBatch" instead of a TokenStream: an  
array of Tokens, rather than a stream of Tokens.  That way, each  
Analyzer can iterate over all the Tokens in a tight loop, either  
natively or in C.  The downside of this technique is that it's not  
possible to feed it directly from a filehandle/Reader, but that's  
small potatoes.

It would be possible to graft the TokenBatch concept onto a GCJ'd  
Lucene: create a native full analysis chain which spits out a  
TokenBatch, then have the TokenBatch pretend it's a TokenStream,  
feeding Tokens to Lucene using a C version of next().  That would  
perform OK -- but you couldn't ever mix and match Java Lucene  
Analyzers with native Analyzers, only prepend the native onto the  
front.  Therefore, you'd have to rewrite the entire  
org.apache.lucene.analysis package anyway -- it's the only way you're  
going to get both full flexibility and performance.  And once you've  
started down the path of rewriting large portions of Lucene, it's  
hard to see why you'd put up with the headache of the GCJ approach.

There are many other areas where Lucene's architecture is poorly  
suited for use with an interpreted language.  Dave has solved those  
problems mainly by rewriting the whole thing in C.  KinoSearch has  
taken that approach in some cases, but more often than Ferret, it  
uses modified algorithms instead.  TokenBatch is one example; the  
best one, which is harder to explain here, is how KinoSearch merges  
together inverted documents during indexing.  (In summary, it's  
faster, simpler, and requires far, far fewer objects.)

It would be possible to port some of these algorithm changes to  
Lucene, but they would be pretty disruptive.  Lucene's a mature,  
heavily-used library and changing anything at all requires a lot of  
consideration.  Some of the changes I would like to see, I don't  
think I could lobby for in good conscience.  The bytecounts-as-string- 
headers patch is a good example.  For Ferret and KinoSearch it's  
adoption would yield a very significant benefit, as it would open the  
door to using Luke to browse indexes.  For Java Lucene, though, it  
can only be justified by further changes which build upon it.

The downside of the full-port approach that Dave and I have taken is  
that it's a lot of work to build and maintain.  However, we've  
already done the vast majority of the up-front work once.  Re-doing  
it for Lucy will be a cakewalk in comparison.  The maintenance  
problem that KinoSearch and Ferret currently face, we're addressing  
by sharing the C core.  We would not be surprised if others join us  
-- I know of at least one other person who rewrote Lucene in C:  
Robert Kirchgessner, who did a partial PHP/C port.  Heck, it will  
presumably be easier to maintain a Python port against Lucy than  
against GCJ'd Lucene, provided that we achieve what we've set out to  
achieve.

The only question remaining, I think, is whether the project will  
actually be hosted at Apache.  When Dave and I approached Doug  
Cutting about it, he specifically requested that development take  
place there -- before Dave or I had had a chance to indicate that  
that was our preference as well.  However, we've been waiting for  
approval by the Lucene PMC for a couple weeks now, and I'm not sure  
its coming.  I'm guessing that Erik "One Lucene To Rule Them All"  
Hatcher hasn't cast his +1.  ;)  IMO, it would be best for everybody  
if we did this within the Lucene family, but we'll just have to see.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] What constitutes a compiled vs. a dynamic language is debatable  
-- see <http://en.wikipedia.org/wiki/Interpreted_language>.  It might  
be more accurate to describe Java as a "more compiled" language.





_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Windows progress

Reply via email to