[ 
https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876732#action_12876732
 ] 

Michael McCandless commented on LUCENE-2492:
--------------------------------------------

bq. We can encode whether the posting is embedded or not by storing a byte or a 
negative pointer for example. There are ways to do it with minimal to no more 
space.

Remember than vInt/Long don't handle negative numbers well (they take max # 
bytes, I think).

bq. The thing is - there is a performance penalty to storing too many bytes in 
the terms dict because it may affect terms lookup. docFreq may not be a very 
good decision.

True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across 
the corpus) also generally tend to have low frequency within that document.  
Hmm, or maybe not -- maybe there's only a single article about Dr. Froobalaz, 
but in that article Froobalaz is mentioned many many times.

bq. For example, a term may have one posting element with a huge payload. 

True, though such apps (the exception not the rule) could override the codec.

Fixed #bytes might also allow for faster scanning, ie if we always leave a 20 
byte slot we know we can then seek +20 bytes ahead, vs pulsing codec which must 
decode the postings for the term when scanning over it.  (Though if we thought 
this mattered we could also write the #bytes up front).

Net/net I think we should pursue this; we should probably keep both options 
available and then we can test.


> Make PulsingCodec (wrapping StandardCodec) the default codec
> ------------------------------------------------------------
>
>                 Key: LUCENE-2492
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2492
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>
> PulsingCodec can provides good gains, by inlining the postings into the terms 
> dict for rare terms.  This is especially helpful for primary key like fields, 
> since every term is rare and batch lookups are common (see 
> http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html 
> for a simple perf test), but it should also be a gain for ordinary fields, 
> thanks to Zipf's law.
> I think we should make it the default....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to