FangYongs opened a new pull request, #4350: URL: https://github.com/apache/paimon/pull/4350
### Purpose Linked issue: close #xxx We use expected rows(5000000) and fpp(0.01) in bloom filter, the writer results are as follows: Records | Value Data Size | Without Bloom Filter(MS) | With Bloom Filter(MS) | Relative -- | -- | -- | -- | -- 100000 | 0B | 6 | 9 | 1.5 100000 | 64B | 11 | 13 | 1.181818182 100000 | 500B | 37 | 34 | 0.918918919 100000 | 1000B | 57 | 58 | 1.01754386 100000 | 2000B | 88 | 95 | 1.079545455 1000000 | 0B | 32 | 64 | 2 1000000 | 64B | 74 | 114 | 1.540540541 1000000 | 500B | 253 | 310 | 1.225296443 1000000 | 1000B | 384 | 440 | 1.145833333 1000000 | 2000B | 534 | 625 | 1.170411985 5000000 | 0B | 164 | 327 | 1.993902439 5000000 | 64B | 356 | 511 | 1.435393258 5000000 | 500B | 1072 | 1197 | 1.116604478 5000000 | 1000B | 1921 | 2338 | 1.21707444 5000000 | 2000B | 3163 | 3480 | 1.100221309 10000000 | 0B | 344 | 733 | 2.130813953 10000000 | 64B | 659 | 1220 | 1.851289833 10000000 | 500B | 2815 | 2991 | 1.062522202 10000000 | 1000B | 4914 | 5407 | 1.1003256 10000000 | 2000B | 8646 | 9170 | 1.060606061 15000000 | 0B | 517 | 1028 | 1.988394584 15000000 | 64B | 1169 | 1718 | 1.469632164 15000000 | 500B | 5435 | 5170 | 0.95124195 15000000 | 1000B | 9962 | 10255 | 1.029411765 15000000 | 2000B | 16203 | 18573 | 1.146269209 The reader results which query data based on keys that are definitely stored are as follows and it indicates that the bloom filter basically does not cause performance degradation. Records | Value Data Size | Without Bloom Filter(MS) | With Bloom Filter(MS) | Relative -- | -- | -- | -- | -- 100000 | 0B | 799 | 807 | 1.010012516 100000 | 64B | 205 | 193 | 0.941463415 100000 | 500B | 171 | 140 | 0.81871345 100000 | 1000B | 168 | 187 | 1.113095238 100000 | 2000B | 170 | 173 | 1.017647059 1000000 | 0B | 791 | 803 | 1.01517067 1000000 | 64B | 221 | 211 | 0.954751131 1000000 | 500B | 145 | 163 | 1.124137931 1000000 | 1000B | 152 | 181 | 1.190789474 1000000 | 2000B | 162 | 164 | 1.012345679 5000000 | 0B | 789 | 778 | 0.986058302 5000000 | 64B | 221 | 224 | 1.013574661 5000000 | 500B | 175 | 161 | 0.92 5000000 | 1000B | 178 | 181 | 1.016853933 5000000 | 2000B | 186 | 186 | 1 10000000 | 0B | 776 | 797 | 1.027061856 10000000 | 64B | 191 | 201 | 1.052356021 10000000 | 500B | 142 | 151 | 1.063380282 10000000 | 1000B | 142 | 159 | 1.11971831 10000000 | 2000B | 159 | 166 | 1.044025157 15000000 | 0B | 820 | 800 | 0.975609756 15000000 | 64B | 260 | 209 | 0.803846154 15000000 | 500B | 139 | 151 | 1.086330935 15000000 | 1000B | 149 | 155 | 1.040268456 15000000 | 2000B | 157 | 164 | 1.044585987 The reader results which query data based on keys that are definitely not stored are as follows and it indicates that the bloom filter can greatly improve performance. Records | Value Data Size | Without Bloom Filter(MS) | With Bloom Filter(MS) | Relative -- | -- | -- | -- | -- 100000 | 0B | 6 | 3 | 0.5 100000 | 64B | 3 | 2 | 0.666666667 100000 | 500B | 3 | 1 | 0.333333333 100000 | 1000B | 4 | 2 | 0.5 100000 | 2000B | 4 | 2 | 0.5 1000000 | 0B | 3 | 2 | 0.666666667 1000000 | 64B | 3 | 2 | 0.666666667 1000000 | 500B | 3 | 2 | 0.666666667 1000000 | 1000B | 4 | 2 | 0.5 1000000 | 2000B | 6 | 3 | 0.5 5000000 | 0B | 3 | 3 | 1 5000000 | 64B | 3 | 2 | 0.666666667 5000000 | 500B | 5 | 3 | 0.6 5000000 | 1000B | 6 | 3 | 0.5 5000000 | 2000B | 10 | 5 | 0.5 10000000 | 0B | 4 | 3 | 0.75 10000000 | 64B | 5 | 4 | 0.8 10000000 | 500B | 6 | 4 | 0.666666667 10000000 | 1000B | 7 | 5 | 0.714285714 10000000 | 2000B | 11 | 8 | 0.727272727 15000000 | 0B | 5 | 4 | 0.8 15000000 | 64B | 5 | 4 | 0.8 15000000 | 500B | 7 | 6 | 0.857142857 15000000 | 1000B | 8 | 8 | 1 15000000 | 2000B | 14 | 13 | 0.928571429 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
