stevenzwu commented on a change in pull request #3001:
URL: https://github.com/apache/iceberg/pull/3001#discussion_r696090966



##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -249,6 +251,21 @@ public Builder uidPrefix(String newPrefix) {
       return this;
     }
 
+    /**
+     * Set the {@link SlidingWindowReservoir} size (number of measurements 
stored)
+     * for the two histogram metrics of data files and delete file size 
distribution.
+     *
+     * @param newReservoirSize the new histogram reservoir size for the file 
size distribution.
+     * default reservoir size is 128, which only add a small memory overhead 
of 1 KB (128 x 8B) per histogram.
+     * For use cases with a lot of files, a larger reservoir size can produce 
more accurate histogram distribution.
+     */
+    public Builder fileSizeHistogramReservoirSize(int newReservoirSize) {

Review comment:
       > Can we automatically detect an appropriate reservoir size based on the 
number of files in a given cycle?
   
   The metrics are registered during initialization. so this may be difficult.
   
   >  Or can we just set it high enough that we don't care? 1KB of overhead is 
tiny so I doubt we care much.
   
   That is a fair point. We can probably hardcode the reservoir size to 1,024 
for now, which translates to 8 KB (quite small). I think 1,024 should gives us 
pretty good accuracy on histogram distribution. we can revisit the decision if 
users do require customization.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to