Hi all, I wrote previously exploring how we have used the Java frequent items sketch in our application, and the steps we took during the process. There are some results of an investigation I performed into serialised sizes, and would be interested in any feedback.
We encountered some limitations imposed by our cloud storage database on the size of the actual blob that we could store for a given key, in this case 5Mb. We needed to find a way to restrict the sketch in such a way as to ensure that we did not exceed this threshold. Fortunately, the sketch exposes a parameter to control the maximum size for the frequent items. However, we are tracking page titles which account for a significant proportion of the overall bytes, and are often variable in length. We did some analysis on the distribution of our raw page titles, and we initially thought of truncating items that exceeded a higher percentile length to attempt to limit the serialized bytes, whilst keeping the map relatively large. However, there are some titles that are in languages that should be read right to left, such as Arabic, which cannot be merely truncated. To better understand the parameters and their impact on byte sizes, I experimented with a characterisation test that generated random items of uniform length, and measured the resulting serialised size using the ArrayOfStringsSerde. I have include the results below: K Max Len Predicted Size (Mb) Actual LB Actual UB 10 100 0.0947265625 43K 51K 10 200 0.16796875 83K 100K 10 300 0.2412109375 122K 141K 10 400 0.314453125 164K 187K 10 500 0.3876953125 214K 239K 11 100 0.189453125 88K 158K 11 200 0.3359375 164K 319K 11 300 0.482421875 283K 463K 11 400 0.62890625 336K 517K 11 500 0.775390625 420K 733K 12 100 0.37890625 174K 340K 12 200 0.671875 357K 605K 12 300 0.96484375 475K 854K 12 400 1.2578125 679K 1.2M 12 500 1.55078125 829K 1.4M 13 100 0.7578125 415K 656K 13 200 1.34375 665K 1.3M 13 300 1.9296875 973K 1.9M 13 400 2.515625 1.4M 2.3M 13 500 3.1015625 1.9M 2.9M 14 100 1.515625 1.0M 1.3M 14 200 2.6875 1.8M 2.4M 14 300 3.859375 2.1M 3.3M 14 400 5.03125 3.5M 4.9M 14 500 6.203125 3.5M 5.9M 15 100 3.03125 1.6M 2.3M 15 200 5.375 3.1M 4.3M 15 300 7.71875 4.7M 5.7M 15 400 10.0625 5.4M 8.6M 15 500 12.40625 7.7M 9.8M 16 100 6.0625 3.8M 5.0M 16 200 10.75 8.0M 9.1M 16 300 15.4375 11M 13M 16 400 20.125 14M 17M 16 500 24.8125 18M 22M 17 100 12.125 6.4M 7.6M 17 200 21.5 12M 16M 17 300 30.875 18M 22M 17 400 40.25 23M 28M 17 500 49.625 26M 34M 18 100 24.25 11M 14M 18 200 43 21M 27M 18 300 61.75 33M 61M 18 400 80.5 41M 49M 18 500 99.25 55M 62M The java docs for the frequent items size gives a guarantee that the worst case size for the items in the multiset, namely 18 * 2^K. This is incorporated into my formula for calculating the predicted size. Given the results above, it appears that we can store satisfy our size constraints at a max map size of 2^13, possibly, 2^14 for strings up to 500 characters in length. I would be glad to provide more information if required, and also grateful to hear any feedback about my approach or findings. Thank you in advance, David
