I agree with Jon on this.  I have also asked Will Lauer for his opinion as
he has lots of experience as a user and developer of Tuple sketches.

Lee.

On Wed, Jun 10, 2020 at 11:06 AM Jon Malkin <[email protected]> wrote:

> I'd actually vote for starting with the AoS model and creating an
> alternative implementation if it's determined to be necessary. My reasoning:
>
> 1) Code duplication is a huge cost. Watching the commits and PRs, it seems
> like we're finding a number of things in the java tuple implementation that
> have diverged from the theta implementation -- both design changes as well
> as actual bugs. This would be ever-present tech debt.
> 2) We don't yet know what "typical" summaries will look like. There'd be a
> larger overhead for tiny summaries like a single char, but if you have any
> sort of struct it'll get padded anyway. Without knowing what's actually
> used, SoA seems like premature optimization considering the concerns in (1).
> 3) The serialized image won't change (without a version change, which
> would also need to be done in java), so this is ultimately about the
> internal implementation. If it turns out that we decide the ability to
> minimize padding for primitive types is that important, we should be able
> to rewrite the internals -- transparently! More work then, perhaps, but
> without API changes so it could be only a minor version bump.
>
>   jon
>
> On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov
> <[email protected]> wrote:
>
>> Hi everyone,
>> I am trying to come up with a prototype for Tuple sketch in C++. The core
>> algorithm is the same as in Theta sketch, which is already implemented in
>> the library. As one of the main requirements I consider reducing the code
>> duplication. I would like to share as much as possible with Theta sketch.
>> Ideally they should have a common base and very thin user-facing facade
>> that would account for the differences.
>>
>> I am facing some painful decisions here. Hopefully someone might have
>> some ideas, and if not, at least this might help to explain some design
>> choices.
>>
>> Theta sketch internally is an array of hash values (uint64_t) with a few
>> more attributes like theta, empty flag, and such.
>> We have a custom open-addressing hash map on top of this array in the
>> Update Theta sketch. The Compact Theta sketch is just an array. We apply
>> some standard algorithms to this array such as std::copy, std::copy_if,
>> std::sort, std::nth_elements and std::set_difference.
>>
>> Tuple sketch associates some payload with each hash value. Essentially
>> there are two approaches here: Array of Structures (AoS) and Structure of
>> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch
>> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined
>> type of the payload. I believe it should be quite easy to make the base
>> layer unaware of the presence of the payload. The above mentioned
>> algorithms will happily work on such a container given the right
>> comparator. However there are significant disadvantages. Suppose we want a
>> tuple_sketch<float>, so the underlying array will be
>> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes
>> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for
>> alignment. And even worse if we want the payload of just one byte.
>>
>> The alternative approach of SoA would be some container of two separate
>> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due
>> to padding can be avoided, but it would be much harder to make the common
>> base layer that would be unaware of the presence or absence of the second
>> vector. I think I know how to iterate over such a container, and even apply
>> transformations, which take an output iterator (std::copy), but I don't see
>> any way of applying std::sort or std::nth_element, which need to swap
>> elements in place. As a result, we would need to have significantly more
>> code specific to a single vector and to two vectors. For instance, we will
>> need our own clone of std::nth_element that would take two vectors and make
>> the same swaps in the second one while doing the core algorithm on the
>> first one.
>>
>> I am a bit torn on this. The AoS approach would be so elegant, but the
>> padding waste does not seem to be acceptable to me. The SoA approach is
>> going to be much uglier, and probably will cost come performance too, but
>> it seems that we will have to go this way.
>> I would appreciate any comments and ideas.
>> Thank you.
>>
>> Alexander Saydakov.
>>
>>

Reply via email to