Hi there, I try to deep dive into the inner LucenePostingFormat to check what might I do for improving query performance. I'm curious about the termBlock stats that I get from checkIndex -verbose.
What does the followong mean: index FST bytes - the FST size, which is the field's partition of the .tip file? num of terms - written 2M, although Luke interface shows me 8M, how come? term / index FST bytes - summing up all my fields bytes doesn't get me close to the .tim / tip file, how come? blocks - these are the SUFFIX blocks (.tim files), which are implemented as Burst Tries, right? block types - where can I get the info about these different types? As background, my main performance issue is (random?) read miss IO while looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed queries, so my optimization is avoiding IO's. That said, is there any reason getting the right block will require more than <segment_count> IO (of 4kB)? Does a certain distribution of prefix length of block types should alarm me in some way? field "text_txt" index FST: 18300 nodes 45779 arc 583438 bytes term: 2053393 terms 25597203 bytes (12.5 bytes/term) blocks: 66086 blocks 51870 terms-only blocks 47 sub-block-only blocks 14169 mixed blocks 13599 floor blocks 22862 non-floor blocks 43224 floor sub-blcoks 18289568 term suffix bytes (276.8 suffix-bytes/block) 4174480 term stas bytes (63.2 stats-bytes/block) 7632796 other bytes (115.5 stats-bytes/block) by prefix length: 0: 1 1: 683 2: 10782 3. 17133 etc... Thanks alot, Manuel