Re: [DISCUSS] Enhancements to ArrayOfDoubles Sketch and Tuple Sketch Directions

David Cromberge Fri, 05 Jun 2020 04:32:13 -0700

As a user of the library, this is an exciting proposal!

I would like to ask a clarifying question about the concept of “columns” or a 
columnar summary.  Does this mean that many different types of summaries (or 
columns) can be associated with a single key?
It might also be a good idea to add compatibility support to the list of 
challenges in the proposal.  If there are existing array of doubles sketches 
already stored in druid clusters, would it be possible to convert to a new 
array of bytes generic representation with a default double[] column?
Lee has mentioned the ability to use protocol buffers or Avro to encode and 
decode the bytes in another thread.  This could allow the sketch implementation 
to use a relatively stable preamble encoding and treat the bytes as opaque, 
delegating to some externally provided codec for version tolerant 
serialisation.  This is an interesting idea because the summary structure is 
more likely to evolve over time, as opposed to the sketch internals, which are 
usually encoded in the preamble.
Finally, if multiple columns are being considered for a Tuple sketch, does the 
same summary mode apply to all columns for updates?  The hypothetical use case 
is for a single sketch that stored three integer columns, max, min, and sum for 
each key.


I’m willing to offer my help in any way possible,
Dave.


> On 5 Jun 2020, at 01:56, leerho <[email protected]> wrote:
> 
> Background
> 
> The base classes of the Tuple Sketch are Java Generic Implementations that 
> allow the user to define custom summaries and custom update types.  Because 
> the internal structure of Tuple Sketch is like a database table, the generic 
> base classes enable the Tuple sketch to be extended to any number of 
> different column types with specific updating functions for each.  
> 
> Although this is quite flexible, Java Generics, in general, suffer from 
> excessive object overhead and computational costs.  As a result, early on in 
> the development of the Tuple sketch we decided to design a dedicated, 
> non-generic implementation of the Tuple Sketch: The ArrayOfDoubles Sketch 
> (AOD).  There were several reasons for this separate dedicated 
> implementation. First, we felt that an array of doubles would be a relatively 
> common use case. Second, we wanted a full concrete implementation of the AOD 
> sketch as an example including both on-heap and off-heap variants. It is this 
> ArrayOfDoubles implementation that has been integrated into Druid, for 
> example.  
> 
> The downside of dedicated implementations is that there is almost no code 
> sharing. This means most of the core Theta Sketch algorithms must be 
> rewritten for each different dedicated implementation. 
> 
> Nonetheless,  as you can imagine, there have been requests for similar 
> dedicated implementations for integers, longs, floats, etc., because of their 
> improved performance.  But these dedicated implementations are complex, with 
> lots of classes for on-heap, vs off-heap operation as well as updatable and 
> immutable forms.  Combined with testing and characterization, it is a great 
> deal of work, and requires good understanding of the underlying sketch 
> algorithms.
> 
> Proposals for Discussion
> As a result, we have had various discussions about some alternative 
> approaches to this challenge of how to get more "bang for the buck" with 
> possibly fewer dedicated implementations and better leverage of the code base.
> The first idea that we have tossed around for a while is instead of dedicated 
> implementations for each of the primitive types, why not create an 
> ArrayOfBytes Sketch.  It is relatively easy to repurpose a byte[] into an 
> int[], long[], float[], doubles[], etc. with tools such Memory or the 
> ByteBuffer.  It might even be possible to have the byte[] be configured like 
> a C struct where different columns could be defined to be different types.  
> Another challenge is how can we make the implementations of the Tuple sketch 
> more tightly integrated with the Theta Sketch.  There have been quite a few 
> improvements and innovations in the Theta Sketch that the Tuple sketch has 
> not benefited from because of lack of code sharing.  This would require a 
> deep look at the two code sets to find ways to increase the code sharing and 
> also bring the base Tuple classes more in line with the Theta Sketch.  We are 
> able to do this in C++ more easily than in Java.  
> The next challenge is to update the current AOD sketch.  There are several 
> enhancements needed:
> Enable custom combiners for union operations similar to what is available for 
> intersection and AnotB.
> Enable set operations with Theta sketch similar to what is now being enabled 
> with the base Tuple generic classes.
> Make sure the AOD API behaves like the Theta API, especially for corner-case 
> handling.
> We would like to hear from users about these proposals and from potential 
> developers that could help us take on some of these tasks.
> 
> Cheers,
> 
> Lee.
> 
> 
> 
>

Re: [DISCUSS] Enhancements to ArrayOfDoubles Sketch and Tuple Sketch Directions

Reply via email to