[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Tue, 05 Sep 2023 02:54:57 -0700


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1706307210


   > 2. random-addressing term information given an ordinal. again no 
additional scan;
   
   Hmm indeed this would require a fixed block size for every term's metadata.
   
   Does Tantivy do pulsing (inlining postings for a singleton terms into the 
terms dictionary)?
   
   Another option might be to have each term block have some sort of header 
array to quickly map a term ordinal (within the block) to its corresponding 
file pointer location?
   
   Or perhaps we keep the scanning within a block when looking for an ord 
within that block?  The FST would still be definitive about whether a term 
exists or not, but then when it exists, we would still need to do some scanning.
   
   Or, for starters, just make all terms metadata fixed width (yes, wasting 
bytes for those terms that don't need the extra stuff).  It'd be a start just 
to simplify playing with this idea, which we could then iterate from?
   
   > I have not gone to the full details of the 
[paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins 
FSTCompiler implementation but I believe mapping to 8-byte ordinals 
(monotonically increasing) are much easier than mapping to variable-length and 
unordered byte[] blobs.
   
   Yeah -- this is very true!  FST is very efficient at encoding monotonically 
increasing int/long outputs, much more so than semi-random looking `byte[]` 
blobs that don't often share prefixes.  Also, with a monotonically increasing 
output, reverse lookup (ord -> term) becomes possible!  I think 
`FSTOrdPostingsFormat` does that (implements the optional 
`TermsEnum.seekExact(long ord)` API)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to