proost opened a new issue, #56:
URL: https://github.com/apache/datasketches-go/issues/56

   # Summary
   
   Adding Compact Sketch and Compact Methods to Update Sketch. 
   
   # Design
   
   This phase I will only handle `CompactSketch` . Wrapped compact sketch will 
be handled in phase 4. Because In this phase, I will add very large code. So 
handling wrapped compact sketch together makes difficult for me and reviewers 
to code and code review.
   
   ## Compact Sketch
   
   Below are methods and signatures of `CompactSketch` . Most methods are from 
`Sketch` interface.  
   
   ```go
   // CompactSketch is an immutable form of the Theta sketch, the form that can 
be serialized and deserialized
   type CompactSketch struct
   
   // IsEmpty returns true if this sketch represents an empty set
   func (s *CompactSketch) IsEmpty() bool
   
   // IsOrdered returns true if retained entries are ordered
   func (s *CompactSketch) IsOrdered() bool
   
   // Theta64 returns theta as a positive integer
   func (s *CompactSketch) Theta64() uint64
   
   // NumRetained returns the number of retained entries
   func (s *CompactSketch) NumRetained() uint32
   
   // SeedHash returns hash of the seed
   func (s *CompactSketch) SeedHash() (uint16, error)
   
   // Estimate returns the estimate of distinct count
   func (s *CompactSketch) Estimate() float64
   
   // LowerBound returns the approximate lower error bound
   func (s *CompactSketch) LowerBound(numStdDevs uint8) (float64, error)
   
   // UpperBound returns the approximate upper error bound
   func (s *CompactSketch) UpperBound(numStdDevs uint8) (float64, error)
   
   // IsEstimationMode returns true if the sketch is in estimation mode
   func (s *CompactSketch) IsEstimationMode() bool
   
   // Theta returns theta as a fraction from 0 to 1
   func (s *CompactSketch) Theta() float64
   
   // String provides a human-readable summary
   func (s *CompactSketch) String(shouldPrintItems bool) string
   
   // All returns hash values in this sketch
   func (s *CompactSketch) All() iter.Seq[uint64]
   
   // MaxSerializedSizeBytes computes maximum serialized size in bytes
   // lgK is the nominal number of entries in the sketch
   func (s *CompactSketch) MaxSerializedSizeBytes(lgK uint8) uint8
   
   // SerializedSizeBytes computes the size in bytes required to serialize the 
current state of the sketch.
   // Computing compressed size is expensive. It takes iterating over all 
retained hashes,
   // and the actual serialization will have to look at them again.
   // compressed if true compressed size is returned (if applicable)
   func (s *CompactSketch) SerializedSizeBytes(compressed bool) int
   
   // MarshalBinary implements encoding.BinaryMarshaler (uncompressed)
   func (s *CompactSketch) MarshalBinary() ([]byte, error)
   ```
   
   `CompactSketch` implements 
[BinaryMarshaler](https://pkg.go.dev/[email protected]#BinaryMarshaler) . But 
`CompactSketch` can’t implement `UnmarshalBinary`. Because we need `seed` 
explicitly to deserialization.
   
   So I follow Encoder / Decoder patterns like 
[encoding/gob](https://pkg.go.dev/encoding/[email protected]) ,  
[encoding/json](https://pkg.go.dev/encoding/[email protected]#Encoder.Encode) , 
[encoding/xml](https://pkg.go.dev/encoding/[email protected]) packages did. But a 
difference from those package is that using `Decoder` and `Encoder` as value. 
Methods of `Encoder` , `Decoder` not change states itself. By using value, we 
can avoid heap allocation. 
   
   ```go
   // Decoder decodes a compact sketch from the given reader.
   type Decoder struct {
        seed uint64
   }
   
   // NewDecoder creates a new decoder.
   func NewDecoder(seed uint64) Decoder {
        return Decoder{
                seed: seed,
        }
   }
   
   // Decode decodes a compact sketch from the given reader.
   func (dec Decoder) Decode(r io.Reader) (*CompactSketch, error)
   
   // Encoder encodes a compact theta sketch to bytes.
   type Encoder struct {
        w          io.Writer
        compressed bool
   }
   
   // NewEncoder creates a new encoder.
   func NewEncoder(w io.Writer, compressed bool) Encoder {
        return Encoder{w: w, compressed: compressed}
   }
   
   // Encode encodes a compact theta sketch to bytes.
   func (enc Encoder) Encode(sketch *CompactSketch) error
   ```
   
   But for convenience, I will add `MarshalBinary` for serialization and 
`Decode` method which is static factory method for `CompactSketch` 
   
   ```go
   // Decode decodes a compact sketch from the given bytes.
   func Decode(bytes []byte, seed uint64) (*CompactSketch, error)
   ```
   
   ## Update Sketch
   
   I will add two methods which missed in the phase 2.
   
   ```go
   func (s *QuickSelectUpdateSketch) Compact(ordered bool) *CompactSketch
   
   func (s *QuickSelectUpdateSketch) CompactOrdered() *CompactSketch
   ```
   
   # Implementation Schedule
   
   I will upload 3 PRs.
   
   1. PR about bit packing utilities, count leading zeroes utilities. Those 
utilities is used to compact sketch.
   2. PR about compact sketch. In this PR, I will handle all things about 
compact sketch. so in this PR, I will upload compatibility test between 
C++,Java and Go.
   3. PR about update sketch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to