[GitHub] [hudi] danny0405 commented on a diff in pull request #7255: [HUDI-5250] use the estimate record size when estimation threshold is l…

GitBox Sun, 20 Nov 2022 19:39:42 -0800


danny0405 commented on code in PR #7255:
URL: https://github.com/apache/hudi/pull/7255#discussion_r1027545946



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java:
##########
@@ -372,7 +372,7 @@ protected static long averageBytesPerRecord(HoodieTimeline 
commitTimeline, Hoodi
     long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
     long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
     try {
-      if (!commitTimeline.empty()) {
+      if (hoodieWriteConfig.getRecordSizeEstimationThreshold() > 0 && 
!commitTimeline.empty()) {
         // Go over the reverse ordered commits to get a more recent estimate 
of average record size.
         Iterator<HoodieInstant> instants = 
commitTimeline.getReverseOrderedInstants().iterator();

Review Comment:
   > data size is fixed in some sense, more the hudi community did not 
encourage adding a more boolean variable to control whether to use the last 
commit avgSize, then we use the estimation threshold, when it is less than 0, 
we use the default estimate record size.
   
   Did you mean the records size Variance is large ？



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on a diff in pull request #7255: [HUDI-5250] use the estimate record size when estimation threshold is l…

Reply via email to