Hi everyone, I am trying to come up with a prototype for Tuple sketch in C++. The core algorithm is the same as in Theta sketch, which is already implemented in the library. As one of the main requirements I consider reducing the code duplication. I would like to share as much as possible with Theta sketch. Ideally they should have a common base and very thin user-facing facade that would account for the differences.
I am facing some painful decisions here. Hopefully someone might have some ideas, and if not, at least this might help to explain some design choices. Theta sketch internally is an array of hash values (uint64_t) with a few more attributes like theta, empty flag, and such. We have a custom open-addressing hash map on top of this array in the Update Theta sketch. The Compact Theta sketch is just an array. We apply some standard algorithms to this array such as std::copy, std::copy_if, std::sort, std::nth_elements and std::set_difference. Tuple sketch associates some payload with each hash value. Essentially there are two approaches here: Array of Structures (AoS) and Structure of Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined type of the payload. I believe it should be quite easy to make the base layer unaware of the presence of the payload. The above mentioned algorithms will happily work on such a container given the right comparator. However there are significant disadvantages. Suppose we want a tuple_sketch<float>, so the underlying array will be std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for alignment. And even worse if we want the payload of just one byte. The alternative approach of SoA would be some container of two separate vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due to padding can be avoided, but it would be much harder to make the common base layer that would be unaware of the presence or absence of the second vector. I think I know how to iterate over such a container, and even apply transformations, which take an output iterator (std::copy), but I don't see any way of applying std::sort or std::nth_element, which need to swap elements in place. As a result, we would need to have significantly more code specific to a single vector and to two vectors. For instance, we will need our own clone of std::nth_element that would take two vectors and make the same swaps in the second one while doing the core algorithm on the first one. I am a bit torn on this. The AoS approach would be so elegant, but the padding waste does not seem to be acceptable to me. The SoA approach is going to be much uglier, and probably will cost come performance too, but it seems that we will have to go this way. I would appreciate any comments and ideas. Thank you. Alexander Saydakov.
