[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979366#action_12979366
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-----------------------------------------

bq. Nope, havent looked at their code... i think i stopped at the documentation 
when i saw how they analyzed text!
All my points are contained within their documentation. No need to look at the 
code (it's as shady as Lucene's).
In the same manner, Lucene had crappy analyzis for years, until you've taken 
hold of (unicode) police baton.
So let's not allow color differences between our analyzers affect our judgement 
on other parts of ours : )

bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows 
machine with whatever we default to.
I'm questioning is there any legal, adequate reason to have that much terms.
I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms 
though :/

A hybrid solution, with term-dict being loaded completely into memory (either 
via mmap, or into arrays) on per-field basis, is probably best in the end, 
however sad it may be.

> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to