OK, my quick answer is that I agree with Jon. I'd lean toward AoS to unify
the algorithm and avoid divergence bewteen the tuple implementation and the
theta implementation. I'm continually worried that our custom tuple sketch
implementation in java is going to diverge from the theta algorithm. I know
we found bugs/issues/differences between them, and I suspect that there
will be others. While I understand that there is wasted padding, I'd rather
live with that waste initially and determine if/when we need to optimize
that later than deal with the possible algorithm differences that come from
multiple implementations.

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Wed, Jun 10, 2020 at 1:25 PM leerho <[email protected]> wrote:

> I agree with Jon on this.  I have also asked Will Lauer for his opinion as
> he has lots of experience as a user and developer of Tuple sketches.
>
> Lee.
>
> On Wed, Jun 10, 2020 at 11:06 AM Jon Malkin <[email protected]> wrote:
>
>> I'd actually vote for starting with the AoS model and creating an
>> alternative implementation if it's determined to be necessary. My reasoning:
>>
>> 1) Code duplication is a huge cost. Watching the commits and PRs, it
>> seems like we're finding a number of things in the java tuple
>> implementation that have diverged from the theta implementation -- both
>> design changes as well as actual bugs. This would be ever-present tech debt.
>> 2) We don't yet know what "typical" summaries will look like. There'd be
>> a larger overhead for tiny summaries like a single char, but if you have
>> any sort of struct it'll get padded anyway. Without knowing what's actually
>> used, SoA seems like premature optimization considering the concerns in (1).
>> 3) The serialized image won't change (without a version change, which
>> would also need to be done in java), so this is ultimately about the
>> internal implementation. If it turns out that we decide the ability to
>> minimize padding for primitive types is that important, we should be able
>> to rewrite the internals -- transparently! More work then, perhaps, but
>> without API changes so it could be only a minor version bump.
>>
>>   jon
>>
>> On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov
>> <[email protected]> wrote:
>>
>>> Hi everyone,
>>> I am trying to come up with a prototype for Tuple sketch in C++. The
>>> core algorithm is the same as in Theta sketch, which is already implemented
>>> in the library. As one of the main requirements I consider reducing the
>>> code duplication. I would like to share as much as possible with Theta
>>> sketch. Ideally they should have a common base and very thin user-facing
>>> facade that would account for the differences.
>>>
>>> I am facing some painful decisions here. Hopefully someone might have
>>> some ideas, and if not, at least this might help to explain some design
>>> choices.
>>>
>>> Theta sketch internally is an array of hash values (uint64_t) with a few
>>> more attributes like theta, empty flag, and such.
>>> We have a custom open-addressing hash map on top of this array in the
>>> Update Theta sketch. The Compact Theta sketch is just an array. We apply
>>> some standard algorithms to this array such as std::copy, std::copy_if,
>>> std::sort, std::nth_elements and std::set_difference.
>>>
>>> Tuple sketch associates some payload with each hash value. Essentially
>>> there are two approaches here: Array of Structures (AoS) and Structure of
>>> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch
>>> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined
>>> type of the payload. I believe it should be quite easy to make the base
>>> layer unaware of the presence of the payload. The above mentioned
>>> algorithms will happily work on such a container given the right
>>> comparator. However there are significant disadvantages. Suppose we want a
>>> tuple_sketch<float>, so the underlying array will be
>>> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes
>>> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for
>>> alignment. And even worse if we want the payload of just one byte.
>>>
>>> The alternative approach of SoA would be some container of two separate
>>> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due
>>> to padding can be avoided, but it would be much harder to make the common
>>> base layer that would be unaware of the presence or absence of the second
>>> vector. I think I know how to iterate over such a container, and even apply
>>> transformations, which take an output iterator (std::copy), but I don't see
>>> any way of applying std::sort or std::nth_element, which need to swap
>>> elements in place. As a result, we would need to have significantly more
>>> code specific to a single vector and to two vectors. For instance, we will
>>> need our own clone of std::nth_element that would take two vectors and make
>>> the same swaps in the second one while doing the core algorithm on the
>>> first one.
>>>
>>> I am a bit torn on this. The AoS approach would be so elegant, but the
>>> padding waste does not seem to be acceptable to me. The SoA approach is
>>> going to be much uglier, and probably will cost come performance too, but
>>> it seems that we will have to go this way.
>>> I would appreciate any comments and ideas.
>>> Thank you.
>>>
>>> Alexander Saydakov.
>>>
>>>

Reply via email to