proost opened a new issue, #67:
URL: https://github.com/apache/datasketches-go/issues/67
# Summary
Add Go version Tuple Sketch.
# Design
## Tuple Sketch
```go
// Summary is the base interface for all summary types used in tuple
sketches.
// A summary holds aggregate data associated with each retained hash key.
type Summary interface {
// Reset clears the content of the summary, restoring it to its initial
state.
Reset()
// Clone creates and returns a deep copy of the current Summary
instance.
Clone() Summary
}
// Sketch is the base interface for tuple sketches.
// It extends Theta sketch to associate arbitrary summaries with each
retained key.
type Sketch[S Summary] interface {
// IsEmpty reports whether this sketch represents an empty set.
// Note: this is not the same as having no retained hashes.
IsEmpty() bool
// Estimate returns the estimated distinct count of the input stream.
Estimate() float64
// LowerBoundFromSubset returns the approximate lower error bound for
// the given number of standard deviations over a subset of retained
hashes.
// numStdDevs specifies the confidence level (1, 2, or 3) corresponding
to
// approximately 67%, 95%, or 99% confidence intervals.
// numSubsetEntries specifies number of items from {0, 1, ...,
get_num_retained()}
// over which to estimate the bound.
LowerBoundFromSubset(numStdDevs uint8, numSubsetEntries uint32)
(float64, error)
// LowerBound returns the approximate lower error bound for the given
// number of standard deviations. numStdDevs should be 1, 2, or 3 for
// approximately 67%, 95%, or 99% confidence intervals.
LowerBound(numStdDevs uint8) (float64, error)
// UpperBoundFromSubset returns the approximate upper error bound for
// the given number of standard deviations over a subset of retained
hashes.
// numStdDevs specifies the confidence level (1, 2, or 3) corresponding
to
// approximately 67%, 95%, or 99% confidence intervals.
// numSubsetEntries specifies number of items from {0, 1, ...,
get_num_retained()}
// over which to estimate the bound.
UpperBoundFromSubset(numStdDevs uint8, numSubsetEntries uint32)
(float64, error)
// UpperBound returns the approximate upper error bound for the given
// number of standard deviations. numStdDevs should be 1, 2, or 3 for
// approximately 67%, 95%, or 99% confidence intervals.
UpperBound(numStdDevs uint8) (float64, error)
// IsEstimationMode reports whether the sketch is in estimation mode,
// as opposed to exact mode.
IsEstimationMode() bool
// Theta returns theta as a fraction from 0 to 1, representing the
// effective sampling rate.
Theta() float64
// Theta64 returns theta as a positive integer between 0 and
math.MaxUint64.
Theta64() uint64
// NumRetained returns the number of hashes retained in the sketch.
NumRetained() uint32
// SeedHash returns the hash of the seed used to hash the input.
SeedHash() (uint16, error)
// IsOrdered reports whether retained hashes are sorted by hash value.
IsOrdered() bool
// String returns a human-readable summary of this sketch.
// If printItems is true, the output includes all retained hashes.
String(shouldPrintItems bool) string
// All returns an iterator over all hash-summary pairs in the sketch.
All() iter.Seq2[uint64, S]
}
```
`Summary` and `Sketch` is base interface all the other interfaces and
implementations.
- `Sketch` is almost same with theta sketch except that `All` method returns
hash value and Summary both.
- `Summary` needs two methods.
- `Reset` : To reduce reallocation of `Summary` implementation. When we
want to reuse `UpdateSketch` , we should reset to initial state. So Summary
needs `Reset` .
- `Clone` : When we convert `UpdateSketch` to `CompactSketch` , Summary
itself should be deep copied.
## Update Tuple Sketch And UpdatableSummary
```go
// UpdatableSummary represents a summary that can be updated with values of
type V.
type UpdatableSummary[V any] interface {
Summary
// Update incorporates a new value into the summary, modifying its
internal state based on the given input value.
Update(value V)
}
// UpdateSketch builds Tuple sketch from input data via update methods.
type UpdateSketch[S UpdatableSummary[V], V any] struct {
newSummary func() S
table *hashtable[S]
}
// IsEstimationMode reports whether the sketch is in estimation mode,
// as opposed to exact mode.
func (s *UpdateSketch[S, V]) IsEstimationMode() bool
// Theta returns theta as a fraction from 0 to 1, representing the
// effective sampling rate.
func (s *UpdateSketch[S, V]) Theta() float64
// Estimate returns the estimated distinct count of the input stream.
func (s *UpdateSketch[S, V]) Estimate() float64
// LowerBoundFromSubset returns the approximate lower error bound for
// the given number of standard deviations over a subset of retained hashes.
// numStdDevs specifies the confidence level (1, 2, or 3) corresponding to
// approximately 67%, 95%, or 99% confidence intervals.
// numSubsetEntries specifies number of items from {0, 1, ...,
get_num_retained()}
// over which to estimate the bound.
func (s *UpdateSketch[S, V]) LowerBoundFromSubset(numStdDevs uint8,
numSubsetEntries uint32) (float64, error
// LowerBound returns the approximate lower error bound for the given
// number of standard deviations. numStdDevs should be 1, 2, or 3 for
// approximately 67%, 95%, or 99% confidence intervals.
func (s *UpdateSketch[S, V]) LowerBound(numStdDevs uint8) (float64, error)
// UpperBoundFromSubset returns the approximate upper error bound for
// the given number of standard deviations over a subset of retained hashes.
// numStdDevs specifies the confidence level (1, 2, or 3) corresponding to
// approximately 67%, 95%, or 99% confidence intervals.
// numSubsetEntries specifies number of items from {0, 1, ...,
get_num_retained()}
// over which to estimate the bound.
func (s *UpdateSketch[S, V]) UpperBoundFromSubset(numStdDevs uint8,
numSubsetEntries uint32) (float64, error)
// UpperBound returns the approximate upper error bound for the given
// number of standard deviations. numStdDevs should be 1, 2, or 3 for
// approximately 67%, 95%, or 99% confidence intervals.
func (s *UpdateSketch[S, V]) UpperBound(numStdDevs uint8) (float64, error)
// String returns a human-readable summary of this sketch.
// If printItems is true, the output includes all retained hashes.
func (s *UpdateSketch[S, V]) String(shouldPrintItems bool) string
// IsEmpty reports whether this sketch represents an empty set.
func (s *UpdateSketch[S, V]) IsEmpty() bool
// IsOrdered reports whether retained hashes are sorted by hash value.
func (s *UpdateSketch[S, V]) IsOrdered() bool
// Theta64 returns theta as a positive integer between 0 and math.MaxUint64.
func (s *UpdateSketch[S, V]) Theta64() uint64
// NumRetained returns the number of hashes retained in the sketch.
func (s *UpdateSketch[S, V]) NumRetained() uint32
// SeedHash returns the hash of the seed used to hash the input.
func (s *UpdateSketch[S, V]) SeedHash() (uint16, error)
// All returns an iterator over all hash-summary pairs in the sketch.
func (s *UpdateSketch[S, V]) All() iter.Seq2[uint64, S]
// LgK returns a configured nominal number of entries in the sketch
func (s *UpdateSketch[S, V]) LgK() uint8
// ResizeFactor returns a configured resize factor of the sketch
func (s *UpdateSketch[S, V]) ResizeFactor() theta.ResizeFacto
// UpdateUint64 updates this sketch with a given unsigned 64-bit integer
func (s *UpdateSketch[S, V]) UpdateUint64(key uint64, value V) error
// UpdateInt64 updates this sketch with a given signed 64-bit integer
func (s *UpdateSketch[S, V]) UpdateInt64(key int64, value V) error
// UpdateUint32 updates this sketch with a given unsigned 32-bit integer
func (s *UpdateSketch[S, V]) UpdateUint32(key uint32, value V) error
// UpdateInt32 updates this sketch with a given signed 32-bit integer
func (s *UpdateSketch[S, V]) UpdateInt32(key int32, value V) error
// UpdateUint16 updates this sketch with a given unsigned 16-bit integer
func (s *UpdateSketch[S, V]) UpdateUint16(key uint16, value V) error
// UpdateInt16 updates this sketch with a given signed 16-bit integer
func (s *UpdateSketch[S, V]) UpdateInt16(key int16, value V) error
// UpdateUint8 updates this sketch with a given unsigned 8-bit integer
func (s *UpdateSketch[S, V]) UpdateUint8(key uint8, value V) error
// UpdateInt8 updates this sketch with a given signed 8-bit integer
func (s *UpdateSketch[S, V]) UpdateInt8(key int8, value V) error
// UpdateFloat64 updates this sketch with a given double-precision floating
point value
func (s *UpdateSketch[S, V]) UpdateFloat64(key float64, value V) error
// UpdateFloat32 updates this sketch with a given floating point value
func (s *UpdateSketch[S, V]) UpdateFloat32(key float32, value V) error
// UpdateString updates this sketch with a given string
func (s *UpdateSketch[S, V]) UpdateString(key string, value V) error
// UpdateBytes updates this sketch with given data
func (s *UpdateSketch[S, V]) UpdateBytes(data []byte, value V) error
// Trim removes retained entries in excess of the nominal size k (if any)
func (s *UpdateSketch[S, V]) Trim()
// Reset resets the sketch to the initial empty state
func (s *UpdateSketch[S, V]) Reset()
// Compact compacts this sketch to a compact sketch.
func (s *UpdateSketch[S, V]) Compact(ordered bool) (*CompactSketch[S], error)
// Filter produces a CompactSketch from this sketch by applying a given
predicate to each entry.
// The predicate should return true for entries to keep.
func (s *UpdateSketch[S, V]) Filter(predicate func(S) bool)
(*CompactSketch[S], error)
```
- `UpdatableSummary` supports update method for summary in-place way. This
interface looks like java interface did. Because Go doesn’t have operator
overloading like C++ does.
- `UpdateSketch` has two fields
- `hashtable`: It is same as theta sketch hashtable. I can reuse
hashtable in theta sketch. But i found that pairing hash and summary is more
efficient in terms of performance. Because We can reduce additional allocation
of slice, reducing additional operation in rebuild etc.
- `newSummary` : static factory method for `Summary` implementation.
When we insert to new hash - summary pair, we should create a new `Summary`
implementation. Like
[`[sync.Pool](https://pkg.go.dev/sync#Pool)`](https://pkg.go.dev/sync#Pool) ,
we should receive factory method from user.
## CompactSketch, Encoder & Decoder
### CompactSketch
```go
type entry[S Summary] struct {
Hash uint64
Summary S
}
// CompactSketch is the immutable, serializable form of a tuple sketch.
type CompactSketch[S Summary] struct {
theta uint64
entries []entry[S]
seedHash uint16
isEmpty bool
isOrdered bool
}
```
- `CompactSketch` has all methods of `Sketch` interface.
- `CompactSketch` has `Filter` method also.
### Encoder
```go
// SummaryWriter writes a summary to the writer.
// Implementations should write the summary in a format that can be read by
a corresponding SummaryReader.
type SummaryWriter[S Summary] func(w io.Writer, s S) error
// Encoder encodes a compact tuple sketch to bytes.
type Encoder[S Summary] struct {
w io.Writer
write SummaryWriter[S]
}
// Encode encodes a compact tuple sketch to bytes.
func (enc *Encoder[S]) Encode(sketch *CompactSketch[S]) error
```
- I will follow encoder / decoder pattern like go `encoding` package did.
- Tuple Sketch needs way to serialize `Summary` . So user should give own
`SummaryWriter` to marshal own `Summary` implementation.
### Decoder
```go
// SummaryReader reads and returns a summary from the reader.
// Implementations should read the format written by a corresponding
SummaryWriter.
type SummaryReader[S Summary] func(r io.Reader) (S, error)
// Decoder decodes a compact sketch from the given reader.
type Decoder[S Summary] struct {
seed uint64
read SummaryReader[S]
}
// Decode decodes a compact sketch from the given reader.
func (dec *Decoder[S]) Decode(r io.Reader) (*CompactSketch[S], error)
```
- I will follow encoder / decoder pattern like go `encoding` package did.
- Tuple Sketch needs way to deserialize `Summary` . So user should give own
`SummaryReader` to unmarshal own `Summary` implementation from bytes.
# Release Schedule
**ArrayOfDouble Sketch is non-goal for this design. I will share design for
ArrayOfDouble sketch in another issue.**
I will send PRs in the following orders:
1. `Sketch` , `Summary`, `hashtable`
2. `UpdateSketch`
3. `CompactSketch` and serialization/deserialization.
4. Set operations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]