I'd actually vote for starting with the AoS model and creating an
alternative implementation if it's determined to be necessary. My reasoning:

1) Code duplication is a huge cost. Watching the commits and PRs, it seems
like we're finding a number of things in the java tuple implementation that
have diverged from the theta implementation -- both design changes as well
as actual bugs. This would be ever-present tech debt.
2) We don't yet know what "typical" summaries will look like. There'd be a
larger overhead for tiny summaries like a single char, but if you have any
sort of struct it'll get padded anyway. Without knowing what's actually
used, SoA seems like premature optimization considering the concerns in (1).
3) The serialized image won't change (without a version change, which would
also need to be done in java), so this is ultimately about the internal
implementation. If it turns out that we decide the ability to minimize
padding for primitive types is that important, we should be able to rewrite
the internals -- transparently! More work then, perhaps, but without API
changes so it could be only a minor version bump.

  jon

On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov
<[email protected]> wrote:

> Hi everyone,
> I am trying to come up with a prototype for Tuple sketch in C++. The core
> algorithm is the same as in Theta sketch, which is already implemented in
> the library. As one of the main requirements I consider reducing the code
> duplication. I would like to share as much as possible with Theta sketch.
> Ideally they should have a common base and very thin user-facing facade
> that would account for the differences.
>
> I am facing some painful decisions here. Hopefully someone might have some
> ideas, and if not, at least this might help to explain some design choices.
>
> Theta sketch internally is an array of hash values (uint64_t) with a few
> more attributes like theta, empty flag, and such.
> We have a custom open-addressing hash map on top of this array in the
> Update Theta sketch. The Compact Theta sketch is just an array. We apply
> some standard algorithms to this array such as std::copy, std::copy_if,
> std::sort, std::nth_elements and std::set_difference.
>
> Tuple sketch associates some payload with each hash value. Essentially
> there are two approaches here: Array of Structures (AoS) and Structure of
> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch
> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined
> type of the payload. I believe it should be quite easy to make the base
> layer unaware of the presence of the payload. The above mentioned
> algorithms will happily work on such a container given the right
> comparator. However there are significant disadvantages. Suppose we want a
> tuple_sketch<float>, so the underlying array will be
> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes
> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for
> alignment. And even worse if we want the payload of just one byte.
>
> The alternative approach of SoA would be some container of two separate
> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due
> to padding can be avoided, but it would be much harder to make the common
> base layer that would be unaware of the presence or absence of the second
> vector. I think I know how to iterate over such a container, and even apply
> transformations, which take an output iterator (std::copy), but I don't see
> any way of applying std::sort or std::nth_element, which need to swap
> elements in place. As a result, we would need to have significantly more
> code specific to a single vector and to two vectors. For instance, we will
> need our own clone of std::nth_element that would take two vectors and make
> the same swaps in the second one while doing the core algorithm on the
> first one.
>
> I am a bit torn on this. The AoS approach would be so elegant, but the
> padding waste does not seem to be acceptable to me. The SoA approach is
> going to be much uglier, and probably will cost come performance too, but
> it seems that we will have to go this way.
> I would appreciate any comments and ideas.
> Thank you.
>
> Alexander Saydakov.
>
>

Reply via email to