proost opened a new issue, #50:
URL: https://github.com/apache/datasketches-go/issues/50

   I will implement Go version Theta sketch. But code base is too large to port 
at once. So I send PRs multiple times.
   
   I refer C++ implementation mainly, but some case refer java implementation.
   
    
   
   # Summary
   
   This time I will add sketch interface and hash table. 
   
   # Design
   
   ## Sketch interface
   
   `Sketch` interface defines common methods of all theta sketch. All 
implementations of Theta sketch should implement below methods.
   
   ```go
   // Sketch is a generalization of the Kth Minimum Value (KMV) sketch.
   type Sketch interface {
        // IsEmpty returns true if this sketch represents an empty set
        // (the same as no retained entries!)
        IsEmpty() bool
   
        // Estimate returns estimate of the distinct count of the input stream
        Estimate() float64
   
        // LowerBound returns the approximate lower error bound given a number 
of standard deviations.
        // This parameter is similar to the number of standard deviations of 
the normal distribution
        // and corresponds to approximately 67%, 95% and 99% confidence 
intervals.
        // numStdDevs number of Standard Deviations (1, 2 or 3)
        LowerBound(numStdDevs uint8) (float64, error)
   
        // UpperBound returns the approximate upper error bound given a number 
of standard deviations.
        // This parameter is similar to the number of standard deviations of 
the normal distribution
        // and corresponds to approximately 67%, 95% and 99% confidence 
intervals.
        // numStdDevs number of Standard Deviations (1, 2 or 3)
        UpperBound(numStdDevs uint8) (float64, error)
   
        // IsEstimationMode returns true if the sketch is in estimation mode
        // (as opposed to exact mode)
        IsEstimationMode() bool
   
        // Theta returns theta as a fraction from 0 to 1 (effective sampling 
rate)
        Theta() float64
   
        // Theta64 returns theta as a positive integer between 0 and 
math.MaxInt64
        Theta64() uint64
   
        // NumRetained returns the number of retained entries in the sketch
        NumRetained() uint32
   
        // SeedHash returns hash of the seed that was used to hash the input
        SeedHash() (uint16, error)
   
        // IsOrdered returns true if retained entries are ordered
        IsOrdered() bool
   
        // String returns a human-readable summary of this sketch as a string
        // If shouldPrintItems is true, include the list of items retained by 
the sketch
        String(shouldPrintItems bool) string
   
        // Iter returns an iterator over hash values in the sketch.
        Iter() iter.Seq[uint64]
   }
   
   ```
   
   I will stick to go idiom naming convention. But if datasketches project has 
style guide or naming convention, then I will follow it.
   
   The interface follows `theta_sketch_alloc` in the C++ implementation. But Go 
doesn’t have exception like C++ or java, So returns error if exception is 
possible.
   
   ## Hashtable
   
   `Hashtable` is fundamental data structure for update sketch, intersection, 
union, etc.
   
   Basically `Hashtable` follows `theta_update_sketch_base` in C++ 
implementation.
   
   Below are signatures of `Hashtable` .
   
   ```go
   var (
        ErrKeyNotFound                = errors.New("key not found")
        ErrKeyNotFoundAndNoEmptySlots = errors.New("key not found and no empty 
slots")
   )
   
   type Hashtable struct
   
   // Copy creates a deep copy of the sketch
   func (t *Hashtable) Copy() *Hashtable
   
   // HashStringAndScreen computes the hash of string and checks if it passes 
theta threshold
   func (t *Hashtable) HashStringAndScreen(data string) (uint64, error)
   
   // HashInt32AndScreen computes the hash of int32 and checks if it passes 
theta threshold
   func (t *Hashtable) HashInt32AndScreen(data int32) (uint64, error)
   
   // HashInt64AndScreen computes the hash of int64 and checks if it passes 
theta threshold
   func (t *Hashtable) HashInt64AndScreen(data int64) (uint64, error)
   
   // HashBytesAndScreen computes the hash of bytes and checks if it passes 
theta threshold
   func (t *Hashtable) HashBytesAndScreen(data []byte) (uint64, error)
   
   // Find searches for a key in the hash table and returns the index if found,
   // or an error if not found.
   // If key is not found and table has empty slots, returns ErrKeyNotFound.
   // If key is not found and table has no empty slots, returns 
ErrKeyNotFoundAndNoEmptySlots.
   func (t *Hashtable) Find(key uint64) (int, error)
   
   // Insert inserts an entry at the given index
   func (t *Hashtable) Insert(index int, entry uint64)
   
   // Trim reduces the sketch to nominal size if needed
   func (t *Hashtable) Trim()
   
   // Reset clears the sketch
   func (t *Hashtable) Reset()
   ```
   
   Like `Sketch` does, I will stick to go idiom naming convention. But if 
datasketches project has style guide or naming convention, then I will follow 
it.
   
   I’ve considered `Hashtable` as private structure. But I found that 
`Hashtable` is used other sketches. So I make it as public structure.
   
   I’ve considered 4 methods `HashStringAndScreen` , `HashInt32AndScreen` , 
`HashInt64AndScreen` , `HashBytesAndScreen` to one method using generic.  
Generics don't help here since each type needs different handling. Also, no 
need to type check and branch is more clean. But It matters of tastes. So if 
you think, generic is better, I will change.
   
   I’ve considered additional 2 methods `HashInt32SliceAndScreen` and 
`HashInt64SliceAndScreen` comparing to Java implementation. Currently I don’t 
understand fully use cases for array type data. So I will not add. But 
`Hashtable` is just structure. Adding new methods are easy. Also There is long 
way to implement theta sketch. So I skip it.
   
   # Implementation Schedule
   
   I will all code using a 1 PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to