All,

Historically, we store most numerics as double precision inside index keys. It makes stored keys independent of the declared scale and allows both prefix and suffix compression. However, int64 keys (BIGINT and largish NUMERICs/DECIMALs) do not fit the double format without risk of rounding/truncation, so they're stored differently -- normalized to the maximum value that fits into double and then re-scaled. The problem is that this format is not well-compressable.

Below is the storage / compression statistics for the same data (10^6 identical values) stored as INT, BIGINT and INT128, all indexed. Column ID is unique, column COL contains some duplicates.

T_INT (130)
    Index I_INT_ID (0)
        Root page: 10201, depth: 2, leaf buckets: 726, nodes: 1000000
        Average node length: 5.85, total dup: 0, max dup: 0
        Average key length: 3.00, compression ratio: 1.32
        Average prefix length: 2.96, average data length: 1.00
    Index I_INT_COL (1)
        Root page: 23911, depth: 2, leaf buckets: 683, nodes: 1000000
        Average node length: 5.51, total dup: 367845, max dup: 8
        Average key length: 2.65, compression ratio: 1.49
        Average prefix length: 3.32, average data length: 0.64

T_BIGINT (131)
    Index I_BIGINT_ID (0)
        Root page: 25109, depth: 3, leaf buckets: 1475, nodes: 1000000
        Average node length: 11.90, total dup: 0, max dup: 0
        Average key length: 9.04, compression ratio: 1.00
        Average prefix length: 2.96, average data length: 6.04
    Index I_BIGINT_COL (1)
        Root page: 26580, depth: 3, leaf buckets: 1157, nodes: 1000000
        Average node length: 9.32, total dup: 367845, max dup: 8
        Average key length: 6.46, compression ratio: 1.39
        Average prefix length: 5.17, average data length: 3.83

T_INT128 (132)
    Index I_INT128_ID (0)
        Root page: 27957, depth: 3, leaf buckets: 739, nodes: 1000000
        Average node length: 5.96, total dup: 0, max dup: 0
        Average key length: 3.10, compression ratio: 1.59
        Average prefix length: 3.87, average data length: 1.05
    Index I_INT128_COL (1)
        Root page: 27979, depth: 2, leaf buckets: 697, nodes: 1000000
        Average node length: 5.61, total dup: 367845, max dup: 8
        Average key length: 2.76, compression ratio: 1.79
        Average prefix length: 4.23, average data length: 0.70

As we can see, both BIGINT indices are almost twice bigger in storage than either INT (which is stored as DOUBLE in the index) or INT128 (which is stored as DECFLOAT in the index), due to longer keys. Note that it also forces I_BIGINT_COL to have depth == 3 rather than 2.

Maybe we couldn't do any better before FB4, but now we have INT128 and DECFLOAT which use more effective index key format, so I asked myself whether it makes sense to utilize it for int64 values too.

Below is the stats if BIGINT would be stored as idx_decimal in the index:

T_BIGINT2 (134)
    Index I_BIGINT2_ID (0)
        Root page: 37193, depth: 3, leaf buckets: 739, nodes: 1000000
        Average node length: 5.96, total dup: 0, max dup: 0
        Average key length: 3.10, compression ratio: 1.59
        Average prefix length: 3.87, average data length: 1.05
    Index I_BIGINT2_COL (1)
        Root page: 37215, depth: 2, leaf buckets: 697, nodes: 1000000
        Average node length: 5.61, total dup: 367845, max dup: 8
        Average key length: 2.76, compression ratio: 1.79
        Average prefix length: 4.23, average data length: 0.70

Storage surely looks good, but internal conversions to Decimal128 are not free. I've done a few artificial tests (index scan in the loop with many iterations, CPU bound as tables fit the page cache), here are the results:

(1) Wide range scan:
  T_INT = 3.53s
  T_BIGINT = 3.82s
  T_INT128 = 4.36s
  T_BIGINT2 = 3.72s

More or less what's expected: different number of pages are accessed, so T_BIGINT2 wins as compared to T_BIGINT. Interesting that T_INT128 has the same index size, but is noticeably slower. Due to a bigger computational overhead?

(2) Narrow range scan:
  T_INT = 2.74s
  T_BIGINT = 2.79s
  T_INT128 = 2.83s
  T_BIGINT2 = 2.76s

Almost the same results for both BIGINT compressions.

(3) Unique scan:
  T_INT = 1.83s
  T_BIGINT = 1.89s
  T_INT128 = 2.09s
  T_BIGINT2 = 2.05s

This is the worst case for T_BIGINT2, it loses ~10% to T_BIGINT.

In cases (2) and (3) index size does not matter, only index depth matters. So the stats reflects CPU overhead of btr.cpp::compress() and data type conversions.

It's also likely that if BIGINT is used as PK and currently its index is 4 levels depth, the new storage will shrink it down to 3 levels and this could noticeably improve both lookup and (especially) join performance.

Does anyone think we should investigate this possibility more closely? Any other comments?


Dmitry


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to