I think we can call this a consensus. Let's go with AoS for now.
Thank you very much for your opinions.

On Wed, Jun 10, 2020 at 11:30 AM Will Lauer <[email protected]>
wrote:

> OK, my quick answer is that I agree with Jon. I'd lean toward AoS to unify
> the algorithm and avoid divergence bewteen the tuple implementation and the
> theta implementation. I'm continually worried that our custom tuple sketch
> implementation in java is going to diverge from the theta algorithm. I know
> we found bugs/issues/differences between them, and I suspect that there
> will be others. While I understand that there is wasted padding, I'd rather
> live with that waste initially and determine if/when we need to optimize
> that later than deal with the possible algorithm differences that come from
> multiple implementations.
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Wed, Jun 10, 2020 at 1:25 PM leerho <[email protected]> wrote:
>
>> I agree with Jon on this.  I have also asked Will Lauer for his opinion
>> as he has lots of experience as a user and developer of Tuple sketches.
>>
>> Lee.
>>
>> On Wed, Jun 10, 2020 at 11:06 AM Jon Malkin <[email protected]> wrote:
>>
>>> I'd actually vote for starting with the AoS model and creating an
>>> alternative implementation if it's determined to be necessary. My reasoning:
>>>
>>> 1) Code duplication is a huge cost. Watching the commits and PRs, it
>>> seems like we're finding a number of things in the java tuple
>>> implementation that have diverged from the theta implementation -- both
>>> design changes as well as actual bugs. This would be ever-present tech debt.
>>> 2) We don't yet know what "typical" summaries will look like. There'd be
>>> a larger overhead for tiny summaries like a single char, but if you have
>>> any sort of struct it'll get padded anyway. Without knowing what's actually
>>> used, SoA seems like premature optimization considering the concerns in (1).
>>> 3) The serialized image won't change (without a version change, which
>>> would also need to be done in java), so this is ultimately about the
>>> internal implementation. If it turns out that we decide the ability to
>>> minimize padding for primitive types is that important, we should be able
>>> to rewrite the internals -- transparently! More work then, perhaps, but
>>> without API changes so it could be only a minor version bump.
>>>
>>>   jon
>>>
>>> On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov
>>> <[email protected]> wrote:
>>>
>>>> Hi everyone,
>>>> I am trying to come up with a prototype for Tuple sketch in C++. The
>>>> core algorithm is the same as in Theta sketch, which is already implemented
>>>> in the library. As one of the main requirements I consider reducing the
>>>> code duplication. I would like to share as much as possible with Theta
>>>> sketch. Ideally they should have a common base and very thin user-facing
>>>> facade that would account for the differences.
>>>>
>>>> I am facing some painful decisions here. Hopefully someone might have
>>>> some ideas, and if not, at least this might help to explain some design
>>>> choices.
>>>>
>>>> Theta sketch internally is an array of hash values (uint64_t) with a
>>>> few more attributes like theta, empty flag, and such.
>>>> We have a custom open-addressing hash map on top of this array in the
>>>> Update Theta sketch. The Compact Theta sketch is just an array. We apply
>>>> some standard algorithms to this array such as std::copy, std::copy_if,
>>>> std::sort, std::nth_elements and std::set_difference.
>>>>
>>>> Tuple sketch associates some payload with each hash value. Essentially
>>>> there are two approaches here: Array of Structures (AoS) and Structure of
>>>> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch
>>>> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined
>>>> type of the payload. I believe it should be quite easy to make the base
>>>> layer unaware of the presence of the payload. The above mentioned
>>>> algorithms will happily work on such a container given the right
>>>> comparator. However there are significant disadvantages. Suppose we want a
>>>> tuple_sketch<float>, so the underlying array will be
>>>> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes
>>>> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for
>>>> alignment. And even worse if we want the payload of just one byte.
>>>>
>>>> The alternative approach of SoA would be some container of two separate
>>>> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due
>>>> to padding can be avoided, but it would be much harder to make the common
>>>> base layer that would be unaware of the presence or absence of the second
>>>> vector. I think I know how to iterate over such a container, and even apply
>>>> transformations, which take an output iterator (std::copy), but I don't see
>>>> any way of applying std::sort or std::nth_element, which need to swap
>>>> elements in place. As a result, we would need to have significantly more
>>>> code specific to a single vector and to two vectors. For instance, we will
>>>> need our own clone of std::nth_element that would take two vectors and make
>>>> the same swaps in the second one while doing the core algorithm on the
>>>> first one.
>>>>
>>>> I am a bit torn on this. The AoS approach would be so elegant, but the
>>>> padding waste does not seem to be acceptable to me. The SoA approach is
>>>> going to be much uglier, and probably will cost come performance too, but
>>>> it seems that we will have to go this way.
>>>> I would appreciate any comments and ideas.
>>>> Thank you.
>>>>
>>>> Alexander Saydakov.
>>>>
>>>>

Reply via email to