I agree with Jon on this. I have also asked Will Lauer for his opinion as he has lots of experience as a user and developer of Tuple sketches.
Lee. On Wed, Jun 10, 2020 at 11:06 AM Jon Malkin <[email protected]> wrote: > I'd actually vote for starting with the AoS model and creating an > alternative implementation if it's determined to be necessary. My reasoning: > > 1) Code duplication is a huge cost. Watching the commits and PRs, it seems > like we're finding a number of things in the java tuple implementation that > have diverged from the theta implementation -- both design changes as well > as actual bugs. This would be ever-present tech debt. > 2) We don't yet know what "typical" summaries will look like. There'd be a > larger overhead for tiny summaries like a single char, but if you have any > sort of struct it'll get padded anyway. Without knowing what's actually > used, SoA seems like premature optimization considering the concerns in (1). > 3) The serialized image won't change (without a version change, which > would also need to be done in java), so this is ultimately about the > internal implementation. If it turns out that we decide the ability to > minimize padding for primitive types is that important, we should be able > to rewrite the internals -- transparently! More work then, perhaps, but > without API changes so it could be only a minor version bump. > > jon > > On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov > <[email protected]> wrote: > >> Hi everyone, >> I am trying to come up with a prototype for Tuple sketch in C++. The core >> algorithm is the same as in Theta sketch, which is already implemented in >> the library. As one of the main requirements I consider reducing the code >> duplication. I would like to share as much as possible with Theta sketch. >> Ideally they should have a common base and very thin user-facing facade >> that would account for the differences. >> >> I am facing some painful decisions here. Hopefully someone might have >> some ideas, and if not, at least this might help to explain some design >> choices. >> >> Theta sketch internally is an array of hash values (uint64_t) with a few >> more attributes like theta, empty flag, and such. >> We have a custom open-addressing hash map on top of this array in the >> Update Theta sketch. The Compact Theta sketch is just an array. We apply >> some standard algorithms to this array such as std::copy, std::copy_if, >> std::sort, std::nth_elements and std::set_difference. >> >> Tuple sketch associates some payload with each hash value. Essentially >> there are two approaches here: Array of Structures (AoS) and Structure of >> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch >> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined >> type of the payload. I believe it should be quite easy to make the base >> layer unaware of the presence of the payload. The above mentioned >> algorithms will happily work on such a container given the right >> comparator. However there are significant disadvantages. Suppose we want a >> tuple_sketch<float>, so the underlying array will be >> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes >> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for >> alignment. And even worse if we want the payload of just one byte. >> >> The alternative approach of SoA would be some container of two separate >> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due >> to padding can be avoided, but it would be much harder to make the common >> base layer that would be unaware of the presence or absence of the second >> vector. I think I know how to iterate over such a container, and even apply >> transformations, which take an output iterator (std::copy), but I don't see >> any way of applying std::sort or std::nth_element, which need to swap >> elements in place. As a result, we would need to have significantly more >> code specific to a single vector and to two vectors. For instance, we will >> need our own clone of std::nth_element that would take two vectors and make >> the same swaps in the second one while doing the core algorithm on the >> first one. >> >> I am a bit torn on this. The AoS approach would be so elegant, but the >> padding waste does not seem to be acceptable to me. The SoA approach is >> going to be much uglier, and probably will cost come performance too, but >> it seems that we will have to go this way. >> I would appreciate any comments and ideas. >> Thank you. >> >> Alexander Saydakov. >> >>
