Understanding FST Prefix & CheckIndex output

Manuel Le Normand Sun, 22 Sep 2013 06:36:31 -0700

Hi there,
I try to deep dive into the inner LucenePostingFormat to check what might I
do for improving query performance. I'm curious about the termBlock stats
that I get from checkIndex -verbose.


What does the followong mean:
index FST bytes - the FST size, which is the field's partition of the .tip
file?
num of terms - written 2M, although Luke interface shows me 8M, how come?
term / index FST bytes - summing up all my fields bytes doesn't get me
close to the .tim / tip file, how come?
blocks - these are the SUFFIX blocks (.tim files), which are implemented as
Burst Tries, right?
block types - where can I get the info about these different types?

As background, my main performance issue is (random?) read miss IO while
looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
queries, so my optimization is avoiding IO's. That said, is there any
reason getting the right block will require more than <segment_count> IO
(of 4kB)?

Does a certain distribution of prefix length of block types should alarm me
in some way?

field "text_txt"
  index FST:
    18300 nodes
    45779 arc
    583438 bytes
  term:
    2053393 terms
    25597203 bytes (12.5 bytes/term)
  blocks:
    66086 blocks
    51870 terms-only blocks
    47 sub-block-only blocks
    14169 mixed blocks
    13599 floor blocks
    22862 non-floor blocks
    43224 floor sub-blcoks
    18289568 term suffix bytes (276.8 suffix-bytes/block)
    4174480 term stas bytes (63.2 stats-bytes/block)
    7632796 other bytes (115.5 stats-bytes/block)
    by prefix length:
      0: 1
      1: 683
      2: 10782
      3. 17133

etc...

Thanks alot,
Manuel

Understanding FST Prefix & CheckIndex output

Reply via email to