Hi everyone,
I am trying to come up with a prototype for Tuple sketch in C++. The core
algorithm is the same as in Theta sketch, which is already implemented in
the library. As one of the main requirements I consider reducing the code
duplication. I would like to share as much as possible with Theta sketch.
Ideally they should have a common base and very thin user-facing facade
that would account for the differences.

I am facing some painful decisions here. Hopefully someone might have some
ideas, and if not, at least this might help to explain some design choices.

Theta sketch internally is an array of hash values (uint64_t) with a few
more attributes like theta, empty flag, and such.
We have a custom open-addressing hash map on top of this array in the
Update Theta sketch. The Compact Theta sketch is just an array. We apply
some standard algorithms to this array such as std::copy, std::copy_if,
std::sort, std::nth_elements and std::set_difference.

Tuple sketch associates some payload with each hash value. Essentially
there are two approaches here: Array of Structures (AoS) and Structure of
Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch
this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined
type of the payload. I believe it should be quite easy to make the base
layer unaware of the presence of the payload. The above mentioned
algorithms will happily work on such a container given the right
comparator. However there are significant disadvantages. Suppose we want a
tuple_sketch<float>, so the underlying array will be
std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes
for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for
alignment. And even worse if we want the payload of just one byte.

The alternative approach of SoA would be some container of two separate
vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due
to padding can be avoided, but it would be much harder to make the common
base layer that would be unaware of the presence or absence of the second
vector. I think I know how to iterate over such a container, and even apply
transformations, which take an output iterator (std::copy), but I don't see
any way of applying std::sort or std::nth_element, which need to swap
elements in place. As a result, we would need to have significantly more
code specific to a single vector and to two vectors. For instance, we will
need our own clone of std::nth_element that would take two vectors and make
the same swaps in the second one while doing the core algorithm on the
first one.

I am a bit torn on this. The AoS approach would be so elegant, but the
padding waste does not seem to be acceptable to me. The SoA approach is
going to be much uglier, and probably will cost come performance too, but
it seems that we will have to go this way.
I would appreciate any comments and ideas.
Thank you.

Alexander Saydakov.

Reply via email to