Hi all,

I wrote previously exploring how we have used the Java frequent items
sketch in our application, and the steps we took during the process.  There
are some results of an investigation I performed into serialised sizes, and
would be interested in any feedback.

We encountered some limitations imposed by our cloud storage database on
the size of the actual blob that we could store for a given key, in this
case 5Mb.  We needed to find a way to restrict the sketch in such a way as
to ensure that we did not exceed this threshold.  Fortunately, the sketch
exposes a parameter to control the maximum size for the frequent items.
However, we are tracking page titles which account for a significant
proportion of the overall bytes, and are often variable in length.
We did some analysis on the distribution of our raw page titles, and we
initially thought of truncating items that exceeded a higher percentile
length to attempt to limit the serialized bytes, whilst keeping the map
relatively large.  However, there are some titles that are in languages
that should be read right to left, such as Arabic, which cannot be merely
truncated.

To better understand the parameters and their impact on byte sizes, I
experimented with a characterisation test that generated random items of
uniform length, and measured the resulting serialised size using the
ArrayOfStringsSerde.  I have include the results below:

K Max Len Predicted Size (Mb) Actual LB Actual UB
10 100 0.0947265625 43K 51K
10 200 0.16796875 83K 100K
10 300 0.2412109375 122K 141K
10 400 0.314453125 164K 187K
10 500 0.3876953125 214K 239K
11 100 0.189453125 88K 158K
11 200 0.3359375 164K 319K
11 300 0.482421875 283K 463K
11 400 0.62890625 336K 517K
11 500 0.775390625 420K 733K
12 100 0.37890625 174K 340K
12 200 0.671875 357K 605K
12 300 0.96484375 475K 854K
12 400 1.2578125 679K 1.2M
12 500 1.55078125 829K 1.4M
13 100 0.7578125 415K 656K
13 200 1.34375 665K 1.3M
13 300 1.9296875 973K 1.9M
13 400 2.515625 1.4M 2.3M
13 500 3.1015625 1.9M 2.9M
14 100 1.515625 1.0M 1.3M
14 200 2.6875 1.8M 2.4M
14 300 3.859375 2.1M 3.3M
14 400 5.03125 3.5M 4.9M
14 500 6.203125 3.5M 5.9M
15 100 3.03125 1.6M 2.3M
15 200 5.375 3.1M 4.3M
15 300 7.71875 4.7M 5.7M
15 400 10.0625 5.4M 8.6M
15 500 12.40625 7.7M 9.8M
16 100 6.0625 3.8M 5.0M
16 200 10.75 8.0M 9.1M
16 300 15.4375 11M 13M
16 400 20.125 14M 17M
16 500 24.8125 18M 22M
17 100 12.125 6.4M 7.6M
17 200 21.5 12M 16M
17 300 30.875 18M 22M
17 400 40.25 23M 28M
17 500 49.625 26M 34M
18 100 24.25 11M 14M
18 200 43 21M 27M
18 300 61.75 33M 61M
18 400 80.5 41M 49M
18 500 99.25 55M 62M
The java docs for the frequent items size gives a guarantee that the worst
case size for the items in the multiset, namely 18 * 2^K.  This is
incorporated into my formula for calculating the predicted size.
Given the results above, it appears that we can store satisfy our size
constraints at a max map size of 2^13, possibly, 2^14 for strings up to 500
characters in length.

I would be glad to provide more information if required, and also grateful
to hear any feedback about my approach or findings.

Thank you in advance,
David

Reply via email to