> > I would like to ask a clarifying question about the concept of “columns” > or a columnar summary. Does this mean that many different types of > summaries (or columns) can be associated with a single key?
No. My use of the word "columns" refers to virtual columns. The data table consists of rows of elements in a specific order. Because the elements of the row can be of different types, the row is a tuple. If all the elements were the same type we could call it a vector. The ArrayOfDoubles sketch is perhaps better named as a VectorOfDoubles, but that is a matter of preference. Nonetheless, element #1 in all the rows constitute the first virtual column, etc. In theory, each "column" could have its own custom update method or mode. It all depends on how elaborate the design of the summary (row) and the update method for a summary. I'll defer the discussion of serialization tools until later, as I have another proposal to put on the table. Lee. On Fri, Jun 5, 2020 at 4:32 AM David Cromberge < [email protected]> wrote: > As a user of the library, this is an exciting proposal! > > I would like to ask a clarifying question about the concept of “columns” > or a columnar summary. Does this mean that many different types of > summaries (or columns) can be associated with a single key? > It might also be a good idea to add compatibility support to the list of > challenges in the proposal. If there are existing array of doubles > sketches already stored in druid clusters, would it be possible to convert > to a new array of bytes generic representation with a default double[] > column? > Lee has mentioned the ability to use protocol buffers or Avro to encode > and decode the bytes in another thread. This could allow the sketch > implementation to use a relatively stable preamble encoding and treat the > bytes as opaque, delegating to some externally provided codec for version > tolerant serialisation. This is an interesting idea because the summary > structure is more likely to evolve over time, as opposed to the sketch > internals, which are usually encoded in the preamble. > Finally, if multiple columns are being considered for a Tuple sketch, does > the same summary mode apply to all columns for updates? The hypothetical > use case is for a single sketch that stored three integer columns, max, > min, and sum for each key. > > I’m willing to offer my help in any way possible, > Dave. > > > On 5 Jun 2020, at 01:56, leerho <[email protected]> wrote: > > *Background* > > The base classes of the Tuple Sketch are Java Generic Implementations that > allow the user to define custom summaries and custom update types. Because > the internal structure of Tuple Sketch is like a database table, the > generic base classes enable the Tuple sketch to be extended to any number > of different column types with specific updating functions for each. > > Although this is quite flexible, Java Generics, in general, suffer from > excessive object overhead and computational costs. As a result, early on > in the development of the Tuple sketch we decided to design a dedicated, > non-generic implementation of the Tuple Sketch: The ArrayOfDoubles Sketch > (AOD). There were several reasons for this separate dedicated > implementation. First, we felt that an array of doubles would be a > relatively common use case. Second, we wanted a full concrete > implementation of the AOD sketch as an example including both on-heap and > off-heap variants. It is this ArrayOfDoubles implementation that has been > integrated into Druid, for example. > > The downside of dedicated implementations is that there is almost no code > sharing. This means most of the core Theta Sketch algorithms must be > rewritten for each different dedicated implementation. > > Nonetheless, as you can imagine, there have been requests for similar > dedicated implementations for integers, longs, floats, etc., because of > their improved performance. But these dedicated implementations are > complex, with lots of classes for on-heap, vs off-heap operation as well as > updatable and immutable forms. Combined with testing and characterization, > it is a great deal of work, and requires good understanding of the > underlying sketch algorithms. > > *Proposals for Discussion* > As a result, we have had various discussions about some alternative > approaches to this challenge of how to get more "bang for the buck" with > possibly fewer dedicated implementations and better leverage of the code > base. > > - The first idea that we have tossed around for a while is instead of > dedicated implementations for each of the primitive types, why not create > an ArrayOfBytes Sketch. It is relatively easy to repurpose a byte[] into > an int[], long[], float[], doubles[], etc. with tools such Memory or the > ByteBuffer. It might even be possible to have the byte[] be configured > like a *C struct* where different columns could be defined to be > different types. > - Another challenge is how can we make the implementations of the > Tuple sketch more tightly integrated with the Theta Sketch. There have > been quite a few improvements and innovations in the Theta Sketch that the > Tuple sketch has not benefited from because of lack of code sharing. This > would require a deep look at the two code sets to find ways to increase the > code sharing and also bring the base Tuple classes more in line with the > Theta Sketch. We are able to do this in C++ more easily than in Java. > - The next challenge is to update the current AOD sketch. There are > several enhancements needed: > - Enable custom combiners for union operations similar to what is > available for intersection and AnotB. > - Enable set operations with Theta sketch similar to what is now > being enabled with the base Tuple generic classes. > - Make sure the AOD API behaves like the Theta API, especially for > corner-case handling. > > We would like to hear from users about these proposals and from potential > developers that could help us take on some of these tasks. > > Cheers, > > Lee. > > > > > >
