mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1706307210
> 2. random-addressing term information given an ordinal. again no additional scan; Hmm indeed this would require a fixed block size for every term's metadata. Does Tantivy do pulsing (inlining postings for a singleton terms into the terms dictionary)? Another option might be to have each term block have some sort of header array to quickly map a term ordinal (within the block) to its corresponding file pointer location? Or perhaps we keep the scanning within a block when looking for an ord within that block? The FST would still be definitive about whether a term exists or not, but then when it exists, we would still need to do some scanning. Or, for starters, just make all terms metadata fixed width (yes, wasting bytes for those terms that don't need the extra stuff). It'd be a start just to simplify playing with this idea, which we could then iterate from? > I have not gone to the full details of the [paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins FSTCompiler implementation but I believe mapping to 8-byte ordinals (monotonically increasing) are much easier than mapping to variable-length and unordered byte[] blobs. Yeah -- this is very true! FST is very efficient at encoding monotonically increasing int/long outputs, much more so than semi-random looking `byte[]` blobs that don't often share prefixes. Also, with a monotonically increasing output, reverse lookup (ord -> term) becomes possible! I think `FSTOrdPostingsFormat` does that (implements the optional `TermsEnum.seekExact(long ord)` API)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org