*Background*
The base classes of the Tuple Sketch are Java Generic Implementations that
allow the user to define custom summaries and custom update types. Because
the internal structure of Tuple Sketch is like a database table, the
generic base classes enable the Tuple sketch to be extended to any number
of different column types with specific updating functions for each.
Although this is quite flexible, Java Generics, in general, suffer from
excessive object overhead and computational costs. As a result, early on
in the development of the Tuple sketch we decided to design a dedicated,
non-generic implementation of the Tuple Sketch: The ArrayOfDoubles Sketch
(AOD). There were several reasons for this separate dedicated
implementation. First, we felt that an array of doubles would be a
relatively common use case. Second, we wanted a full concrete
implementation of the AOD sketch as an example including both on-heap and
off-heap variants. It is this ArrayOfDoubles implementation that has been
integrated into Druid, for example.
The downside of dedicated implementations is that there is almost no code
sharing. This means most of the core Theta Sketch algorithms must be
rewritten for each different dedicated implementation.
Nonetheless, as you can imagine, there have been requests for similar
dedicated implementations for integers, longs, floats, etc., because of
their improved performance. But these dedicated implementations are
complex, with lots of classes for on-heap, vs off-heap operation as well as
updatable and immutable forms. Combined with testing and characterization,
it is a great deal of work, and requires good understanding of the
underlying sketch algorithms.
*Proposals for Discussion*
As a result, we have had various discussions about some alternative
approaches to this challenge of how to get more "bang for the buck" with
possibly fewer dedicated implementations and better leverage of the code
base.
- The first idea that we have tossed around for a while is instead of
dedicated implementations for each of the primitive types, why not create
an ArrayOfBytes Sketch. It is relatively easy to repurpose a byte[] into
an int[], long[], float[], doubles[], etc. with tools such Memory or the
ByteBuffer. It might even be possible to have the byte[] be configured
like a *C struct* where different columns could be defined to be
different types.
- Another challenge is how can we make the implementations of the Tuple
sketch more tightly integrated with the Theta Sketch. There have been
quite a few improvements and innovations in the Theta Sketch that the Tuple
sketch has not benefited from because of lack of code sharing. This would
require a deep look at the two code sets to find ways to increase the code
sharing and also bring the base Tuple classes more in line with the Theta
Sketch. We are able to do this in C++ more easily than in Java.
- The next challenge is to update the current AOD sketch. There are
several enhancements needed:
- Enable custom combiners for union operations similar to what is
available for intersection and AnotB.
- Enable set operations with Theta sketch similar to what is now
being enabled with the base Tuple generic classes.
- Make sure the AOD API behaves like the Theta API, especially for
corner-case handling.
We would like to hear from users about these proposals and from potential
developers that could help us take on some of these tasks.
Cheers,
Lee.