Re: [PR] Added new method (datasketches-java)

via GitHub Wed, 15 May 2024 14:14:45 -0700


leerho commented on code in PR #559:
URL: https://github.com/apache/datasketches-java/pull/559#discussion_r1602260373



##########
src/main/java/org/apache/datasketches/theta/Sketch.java:
##########
@@ -297,13 +297,27 @@ public double getLowerBound(final int numStdDev) {
    * @param numberOfEntries the actual number of entries stored with the 
CompactSketch.
    * @return the maximum number of storage bytes required for a CompactSketch 
with the given number
    * of entries.
+   * @deprecated as a public method. Use {@link #getCompactSketchMaxBytes(int) 
instead}
    */
+  @Deprecated
   public static int getMaxCompactSketchBytes(final int numberOfEntries) {
     if (numberOfEntries == 0) { return 8; }
     if (numberOfEntries == 1) { return 16; }
     return (numberOfEntries << 3) + 24;
   }
 
+  /**
+   * Returns the maximum number of storage bytes required for a CompactSketch 
given the configured
+   * number of nominal entries (power of 2).
+   * @param nomEntries <a 
href="{@docRoot}/resources/dictionary.html#nomEntries">Nominal Entries</a>
+   * @return the maximum number of storage bytes required for a CompactSketch 
with the given
+   * nomEntries.
+   */
+  public static int getCompactSketchMaxBytes(final int nomEntries) {

Review Comment:
   I started to implement this idea of trying to have one place that computes 
the 2 * K * rebuild_Threshold.  
   But it turned out to be much more complicated so I decided to leave it alone 
for now.
   - The getHashTableThreshold(lgNomLongs, lgArrLongs) in both Direct and Heap 
sketches actually does the computation from the lgArrLongs parameter, not the 
lgNomLongs parameter.  This makes sense since because this is computing the 
threshold for the hashTable.   The lgNomLongs is ONLY used to determine if the 
sketch is in a growing/resizing mode or at its full size, when a full rebuild 
is required.
   - This means that these 2 methods cannot use a common method that computes 
maximum capacity based on lgNomLongs.  
   
   If, and when, we decide to make an improvement in this area, I want to do 
one more improvement:
   
   We have 3 static final double constants: REBUILD_THRESHOLD, 
RESIZE_THRESHOLD, AND DQS_RESIZE_THRESHOLD that are used for tuning the 
performance of the hash tables in different situations. 
   
   Unfortunately, at the time, I decided to use doubles for these constants.  
This forces a computation like this:
   `return (int) Math.floor(<double_constant> * (1 << lgArrLongs));` every time 
they are used.
   
   However, the hash array sizes are always powers of 2, and never smaller than 
16, and the constants are either 15/16 or 0.5, or fractions that can exactly 
produce an integer for all possible sizes of the hash tables. 
   
   This means, instead of using double constants, we could use static final 
methods that eliminate the double arithmetic and Math.floor.  So the above 
equation becomes, for one example:
   `return ((1 << lgArrLongs) * 15) / 16; `
   or even better: 
   `return ((16 << lgArrLongs) - (1 << lgArrLongs)) >>> 4`
   which is much faster.
   
   I looked into this, but eliminating these double constants propagates 
everywhere and is a more complicated than I want to attempt right now.  We can 
put this into our backlog.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Added new method (datasketches-java)

Reply via email to