Hi David,
Our goal for the maximum size calculation is to provide an a priori upper
bound on the size for storage planning purposes.  And from what I can tell
from David's table, it meets that goal.

To quantify this, I would add an additional column "Margin" and compute the
margin as (PredictedSize / ActualUB -1).  Doing this with the data from
your table reveals that the predicted size is always larger than your
measured UB, and the margin varies from +1.23% to about +85%.  The LB
column is interesting but is not useful for quantifying how well the sketch
performs against the goal.

Nonetheless, for readers of this thread, it is important to note that the
above formula that Dave refers to (18 * *maxMapSize) or (18 * 2^K)* for his
table, does not include storage required for the items <T>, which David
included in his calculations (since he knows what his type T is).

Cheers,

Lee.











On Mon, Nov 16, 2020 at 4:39 AM David Cromberge <[email protected]>
wrote:

> Hi all,
>
> I wrote previously exploring how we have used the Java frequent items
> sketch in our application, and the steps we took during the process.  There
> are some results of an investigation I performed into serialised sizes, and
> would be interested in any feedback.
>
> We encountered some limitations imposed by our cloud storage database on
> the size of the actual blob that we could store for a given key, in this
> case 5Mb.  We needed to find a way to restrict the sketch in such a way as
> to ensure that we did not exceed this threshold.  Fortunately, the sketch
> exposes a parameter to control the maximum size for the frequent items.
> However, we are tracking page titles which account for a significant
> proportion of the overall bytes, and are often variable in length.
> We did some analysis on the distribution of our raw page titles, and we
> initially thought of truncating items that exceeded a higher percentile
> length to attempt to limit the serialized bytes, whilst keeping the map
> relatively large.  However, there are some titles that are in languages
> that should be read right to left, such as Arabic, which cannot be merely
> truncated.
>
> To better understand the parameters and their impact on byte sizes, I
> experimented with a characterisation test that generated random items of
> uniform length, and measured the resulting serialised size using the
> ArrayOfStringsSerde.  I have include the results below:
>
> K Max Len Predicted Size (Mb) Actual LB Actual UB
> 10 100 0.0947265625 43K 51K
> 10 200 0.16796875 83K 100K
> 10 300 0.2412109375 122K 141K
> 10 400 0.314453125 164K 187K
> 10 500 0.3876953125 214K 239K
> 11 100 0.189453125 88K 158K
> 11 200 0.3359375 164K 319K
> 11 300 0.482421875 283K 463K
> 11 400 0.62890625 336K 517K
> 11 500 0.775390625 420K 733K
> 12 100 0.37890625 174K 340K
> 12 200 0.671875 357K 605K
> 12 300 0.96484375 475K 854K
> 12 400 1.2578125 679K 1.2M
> 12 500 1.55078125 829K 1.4M
> 13 100 0.7578125 415K 656K
> 13 200 1.34375 665K 1.3M
> 13 300 1.9296875 973K 1.9M
> 13 400 2.515625 1.4M 2.3M
> 13 500 3.1015625 1.9M 2.9M
> 14 100 1.515625 1.0M 1.3M
> 14 200 2.6875 1.8M 2.4M
> 14 300 3.859375 2.1M 3.3M
> 14 400 5.03125 3.5M 4.9M
> 14 500 6.203125 3.5M 5.9M
> 15 100 3.03125 1.6M 2.3M
> 15 200 5.375 3.1M 4.3M
> 15 300 7.71875 4.7M 5.7M
> 15 400 10.0625 5.4M 8.6M
> 15 500 12.40625 7.7M 9.8M
> 16 100 6.0625 3.8M 5.0M
> 16 200 10.75 8.0M 9.1M
> 16 300 15.4375 11M 13M
> 16 400 20.125 14M 17M
> 16 500 24.8125 18M 22M
> 17 100 12.125 6.4M 7.6M
> 17 200 21.5 12M 16M
> 17 300 30.875 18M 22M
> 17 400 40.25 23M 28M
> 17 500 49.625 26M 34M
> 18 100 24.25 11M 14M
> 18 200 43 21M 27M
> 18 300 61.75 33M 61M
> 18 400 80.5 41M 49M
> 18 500 99.25 55M 62M
> The java docs for the frequent items size gives a guarantee that the worst
> case size for the items in the multiset, namely 18 * 2^K.  This is
> incorporated into my formula for calculating the predicted size.
> Given the results above, it appears that we can store satisfy our size
> constraints at a max map size of 2^13, possibly, 2^14 for strings up to 500
> characters in length.
>
> I would be glad to provide more information if required, and also grateful
> to hear any feedback about my approach or findings.
>
> Thank you in advance,
> David
>
>

Reply via email to