Heh, Cool :-) In principle we might be able to, but it will be a while, as our legal and biz dev will be involved. However, I do believe everything I did was referred to by Dave as some point. Most of the changes are pretty obvious if you run through the code.
I'm about to do a bunch of benchmarking (maybe 2 weeks?) on Linux and Solaris, of Texis and Lucene, in 4 different configurations (weighted and unweighted, sloppy phrase match and conjunctive). I'll post a summary :) A lot about optimizing Lucene involves taming GC with RAMdirectory. I would say that using RAMdirectory is a huge saving. Minimize fields -- have one indexed, tokenized, not stored, one with the "content" as a monolithic field (parse it afterwards). Write a custom Hit Collector if appropriate. Minimize classes, stick with Java builtins as much as possible. There are other considerations choosing between texis and lucene -- cost(!!) and caching (as I said). Memory maxes out at 4GB on most normal boxes, so if you can't fit your document base and index in <4GB, then you need the caching. Winton >Hello, > >Funny, I was just wondering how Lucene compares to Texis the other day. >Yes, I guess Lucene doesn't have any caching. Perhaps this could >easily be added by making use of one of many caching projects that seem >to be popping up under Jakarta (jakarta.apache.org). > >Winston, if appropriate, could you share some of the changes you made >to Lucene to support the query rate that you mentioned? > >Thanks, >Otis > > >--- Winton Davies <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We're (Overture/Goto) evaluating Lucene ... email me specific >> questions. >> >> In general I would say Lucene is very efficient. It is only about >> 30% slower than Thunderstone Texis >> (which is a native C code base). Main difference is that Lucene >> doesn't handle Caching as well as >> Texis does. >> >> Basically the Index is on Disk or in RAM (ie can take up 400-500 MB >> >> in our application). Texis for example >> is able to buffer what it can of the Index in memory without >> explicit setting of memory limits. >> >> Out of the box we couldn't use Phrase Matching for very high volume >> >> transactions (we're looking at 1000s queries/sec) >> and had to customize it to your needs, but because its Open Source, >> >> guess what, you can write any kind >> of optimizations you want. Actually that isn't fair -- just be >> careful that you understand the performance >> parameters involved in text retrieval and the various types of >> querys that are possible. Do you need Text Retrieval >> or Are you doing an unranked "Text Search" ? >> >> >> Oh, and its free :) >> >> Reliable ? Well I've never had a problem someone couldnt answer, >> and >> it never crashes (ie its pretty bug-free >> as far as I can tell). >o:[EMAIL PROTECTED]> >> > > >__________________________________________________ >Do You Yahoo!? >Send FREE video emails in Yahoo! Mail! >http://promo.yahoo.com/videomail/ > >-- >To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> >For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
