I think we can call this a consensus. Let's go with AoS for now. Thank you very much for your opinions.
On Wed, Jun 10, 2020 at 11:30 AM Will Lauer <[email protected]> wrote: > OK, my quick answer is that I agree with Jon. I'd lean toward AoS to unify > the algorithm and avoid divergence bewteen the tuple implementation and the > theta implementation. I'm continually worried that our custom tuple sketch > implementation in java is going to diverge from the theta algorithm. I know > we found bugs/issues/differences between them, and I suspect that there > will be others. While I understand that there is wasted padding, I'd rather > live with that waste initially and determine if/when we need to optimize > that later than deal with the possible algorithm differences that come from > multiple implementations. > > <http://www.verizonmedia.com> > > Will Lauer > > Senior Principal Architect, Audience & Advertising Reporting > Data Platforms & Systems Engineering > > M 508 561 6427 > 1908 S. First St > Champaign, IL 61822 > > <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> > <https://www.linkedin.com/company/verizon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Wed, Jun 10, 2020 at 1:25 PM leerho <[email protected]> wrote: > >> I agree with Jon on this. I have also asked Will Lauer for his opinion >> as he has lots of experience as a user and developer of Tuple sketches. >> >> Lee. >> >> On Wed, Jun 10, 2020 at 11:06 AM Jon Malkin <[email protected]> wrote: >> >>> I'd actually vote for starting with the AoS model and creating an >>> alternative implementation if it's determined to be necessary. My reasoning: >>> >>> 1) Code duplication is a huge cost. Watching the commits and PRs, it >>> seems like we're finding a number of things in the java tuple >>> implementation that have diverged from the theta implementation -- both >>> design changes as well as actual bugs. This would be ever-present tech debt. >>> 2) We don't yet know what "typical" summaries will look like. There'd be >>> a larger overhead for tiny summaries like a single char, but if you have >>> any sort of struct it'll get padded anyway. Without knowing what's actually >>> used, SoA seems like premature optimization considering the concerns in (1). >>> 3) The serialized image won't change (without a version change, which >>> would also need to be done in java), so this is ultimately about the >>> internal implementation. If it turns out that we decide the ability to >>> minimize padding for primitive types is that important, we should be able >>> to rewrite the internals -- transparently! More work then, perhaps, but >>> without API changes so it could be only a minor version bump. >>> >>> jon >>> >>> On Tue, Jun 9, 2020 at 2:07 PM Alexander Saydakov >>> <[email protected]> wrote: >>> >>>> Hi everyone, >>>> I am trying to come up with a prototype for Tuple sketch in C++. The >>>> core algorithm is the same as in Theta sketch, which is already implemented >>>> in the library. As one of the main requirements I consider reducing the >>>> code duplication. I would like to share as much as possible with Theta >>>> sketch. Ideally they should have a common base and very thin user-facing >>>> facade that would account for the differences. >>>> >>>> I am facing some painful decisions here. Hopefully someone might have >>>> some ideas, and if not, at least this might help to explain some design >>>> choices. >>>> >>>> Theta sketch internally is an array of hash values (uint64_t) with a >>>> few more attributes like theta, empty flag, and such. >>>> We have a custom open-addressing hash map on top of this array in the >>>> Update Theta sketch. The Compact Theta sketch is just an array. We apply >>>> some standard algorithms to this array such as std::copy, std::copy_if, >>>> std::sort, std::nth_elements and std::set_difference. >>>> >>>> Tuple sketch associates some payload with each hash value. Essentially >>>> there are two approaches here: Array of Structures (AoS) and Structure of >>>> Arrays (SoA). Using AoS is easy and straightforward. For the Tuple sketch >>>> this would be std::vector<std::pair<uint64_t, T>> where T is a user-defined >>>> type of the payload. I believe it should be quite easy to make the base >>>> layer unaware of the presence of the payload. The above mentioned >>>> algorithms will happily work on such a container given the right >>>> comparator. However there are significant disadvantages. Suppose we want a >>>> tuple_sketch<float>, so the underlying array will be >>>> std::vector<std::pair<uint64_t, float>>. This approach will waste 4 bytes >>>> for each entry since sizeof(pair<uint64_t, float>)=16 due to padding for >>>> alignment. And even worse if we want the payload of just one byte. >>>> >>>> The alternative approach of SoA would be some container of two separate >>>> vectors: std::vector<uint64_t> and std::vector<T>. This way the waste due >>>> to padding can be avoided, but it would be much harder to make the common >>>> base layer that would be unaware of the presence or absence of the second >>>> vector. I think I know how to iterate over such a container, and even apply >>>> transformations, which take an output iterator (std::copy), but I don't see >>>> any way of applying std::sort or std::nth_element, which need to swap >>>> elements in place. As a result, we would need to have significantly more >>>> code specific to a single vector and to two vectors. For instance, we will >>>> need our own clone of std::nth_element that would take two vectors and make >>>> the same swaps in the second one while doing the core algorithm on the >>>> first one. >>>> >>>> I am a bit torn on this. The AoS approach would be so elegant, but the >>>> padding waste does not seem to be acceptable to me. The SoA approach is >>>> going to be much uglier, and probably will cost come performance too, but >>>> it seems that we will have to go this way. >>>> I would appreciate any comments and ideas. >>>> Thank you. >>>> >>>> Alexander Saydakov. >>>> >>>>
