Heh,

  Cool :-) In principle we might be able to, but it will be a while, 
as our legal and biz dev will be involved.
  However, I do believe everything I did was referred to by Dave as 
some point. Most of the changes are pretty obvious if you run through 
the code.

  I'm about to do a bunch of benchmarking (maybe 2 weeks?) on Linux 
and Solaris, of Texis and Lucene, in 4 different configurations 
(weighted and unweighted, sloppy phrase match and conjunctive). I'll 
post a summary :)

  A lot about optimizing Lucene involves taming GC with RAMdirectory. 
I would say that using RAMdirectory is a huge saving.
Minimize fields -- have one indexed, tokenized, not stored, one with 
the "content" as a monolithic field (parse it afterwards). Write a 
custom Hit Collector if appropriate. Minimize classes, stick with 
Java builtins as much as possible.

There are other considerations choosing between texis and lucene -- 
cost(!!) and caching (as I said). Memory maxes out at 4GB on most 
normal boxes, so if you can't fit your document base and index in 
<4GB, then you need the caching.

  Winton



>Hello,
>
>Funny, I was just wondering how Lucene compares to Texis the other day.
>Yes, I guess Lucene doesn't have any caching.  Perhaps this could
>easily be added by making use of one of many caching projects that seem
>to be popping up under Jakarta (jakarta.apache.org).
>
>Winston, if appropriate, could you share some of the changes you made
>to Lucene to support the query rate that you mentioned?
>
>Thanks,
>Otis
>
>
>--- Winton Davies <[EMAIL PROTECTED]> wrote:
>>  Hi,
>>
>>    We're (Overture/Goto) evaluating Lucene ... email me specific
>>  questions.
>>
>>    In general I would say Lucene is very efficient. It is only about
>>  30% slower than Thunderstone Texis
>>    (which is a native C code base). Main difference is that Lucene
>>  doesn't handle Caching as well as
>>    Texis does.
>>
>>    Basically the Index is on Disk or in RAM (ie can take up 400-500 MB
>>
>>  in our application).  Texis for example
>>    is able to buffer what it can of the Index in memory without
>>  explicit setting of memory limits.
>>
>>    Out of the box we couldn't use Phrase Matching for very high volume
>>
>>  transactions (we're looking at 1000s queries/sec)
>>    and had to customize it to your needs, but because its Open Source,
>>
>>  guess what, you can write any kind
>>    of optimizations you want. Actually that isn't fair --  just be
>>  careful that you understand the performance
>>    parameters involved in text retrieval and the various types of
>>  querys that are possible. Do you need Text Retrieval
>>    or Are you doing an unranked "Text Search" ?
>>
>>
>>    Oh, and its free :)
>>
>>    Reliable ? Well I've never had a problem someone couldnt answer,
>>  and
>>  it never crashes (ie its pretty bug-free
>>    as far as I can tell).
>o:[EMAIL PROTECTED]>
>>
>
>
>__________________________________________________
>Do You Yahoo!?
>Send FREE video emails in Yahoo! Mail!
>http://promo.yahoo.com/videomail/
>
>--
>To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

-- 

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to