Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-14 Thread via GitHub


yihua merged PR #10763:
URL: https://github.com/apache/hudi/pull/10763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2110945188

   
   ## CI report:
   
   * d908bfc944e48ffcf4691eec0a41a973e68475e9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23925)
 
   * a74546dac32390ed02715ba1b3d02714e4689d9d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2110923627

   
   ## CI report:
   
   * d908bfc944e48ffcf4691eec0a41a973e68475e9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23925)
 
   * a74546dac32390ed02715ba1b3d02714e4689d9d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2110803700

   
   ## CI report:
   
   * 233ebc68121cdeacb8e02f703e8a6724f3e8409f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23878)
 
   * d908bfc944e48ffcf4691eec0a41a973e68475e9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23925)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2110791396

   
   ## CI report:
   
   * 233ebc68121cdeacb8e02f703e8a6724f3e8409f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23878)
 
   * d908bfc944e48ffcf4691eec0a41a973e68475e9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2108773833

   
   ## CI report:
   
   * 233ebc68121cdeacb8e02f703e8a6724f3e8409f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2108687406

   
   ## CI report:
   
   * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840)
 
   * 233ebc68121cdeacb8e02f703e8a6724f3e8409f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-13 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2108675075

   
   ## CI report:
   
   * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840)
 
   * 233ebc68121cdeacb8e02f703e8a6724f3e8409f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-10 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105440971

   
   ## CI report:
   
   * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-10 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105414125

   
   ## CI report:
   
   * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804)
 
   * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-10 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105411335

   
   ## CI report:
   
   * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804)
 
   * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-09 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2103552339

   
   ## CI report:
   
   * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-09 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2103461501

   
   ## CI report:
   
   * 9a575fe6f6126090d0b380ec23d1a1d7fe1f4a1b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23686)
 
   * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-09 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2103452414

   
   ## CI report:
   
   * 9a575fe6f6126090d0b380ec23d1a1d7fe1f4a1b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23686)
 
   * 34ffbbc913fab393871b866160ea2a7e1b38c53f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-06 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2095357191

   
   ## CI report:
   
   * 9a575fe6f6126090d0b380ec23d1a1d7fe1f4a1b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-06 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2095345766

   
   ## CI report:
   
   * 34711f0b4fd724a5a6631b4fdacd90acebe53ca1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23639)
 
   * 9a575fe6f6126090d0b380ec23d1a1d7fe1f4a1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-06 Thread via GitHub


nsivabalan commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1590594514


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   addressed it. you can take a look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-02 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2092056831

   
   ## CI report:
   
   * 34711f0b4fd724a5a6631b4fdacd90acebe53ca1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23639)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-02 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-2092052479

   
   ## CI report:
   
   * 95411f507afa43c6e5bf95e8bf1f87bbc03beb49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22702)
 
   * 34711f0b4fd724a5a6631b4fdacd90acebe53ca1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-05-02 Thread via GitHub


yihua commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1588601099


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   @nsivabalan have you addressed the comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1571806302


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   gotcha. makes sense. will address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-04-18 Thread via GitHub


the-other-tim-brown commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1571784969


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   @nsivabalan I think this try/catch should be done at the instant parsing 
level (line 59). If there is only a single failure to read the commit metadata 
then we should still attempt to use the other commits' metadata



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-03-01 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1972864407

   
   ## CI report:
   
   * 95411f507afa43c6e5bf95e8bf1f87bbc03beb49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-29 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1972616352

   
   ## CI report:
   
   * 367c1d63b69937d7bb37ae82ce1b73574f35a59b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22679)
 
   * 95411f507afa43c6e5bf95e8bf1f87bbc03beb49 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-29 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1972609482

   
   ## CI report:
   
   * 367c1d63b69937d7bb37ae82ce1b73574f35a59b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22679)
 
   * 95411f507afa43c6e5bf95e8bf1f87bbc03beb49 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-27 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1968330529

   
   ## CI report:
   
   * 367c1d63b69937d7bb37ae82ce1b73574f35a59b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22679)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-27 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1968164943

   
   ## CI report:
   
   * d3a7378bb065eac3d517d9d6b8e7c8513a47f342 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22658)
 
   * 367c1d63b69937d7bb37ae82ce1b73574f35a59b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22679)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-27 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1968159520

   
   ## CI report:
   
   * d3a7378bb065eac3d517d9d6b8e7c8513a47f342 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22658)
 
   * 367c1d63b69937d7bb37ae82ce1b73574f35a59b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1965884909

   
   ## CI report:
   
   * d3a7378bb065eac3d517d9d6b8e7c8513a47f342 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22658)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


nsivabalan commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1503703204


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {
+  if (!commitTimeline.empty()) {
+// Go over the reverse ordered commits to get a more recent estimate 
of average record size.
+Iterator instants = 
commitTimeline.getReverseOrderedInstants().iterator();
+while (instants.hasNext()) {
+  HoodieInstant instant = instants.next();
+  HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+  .fromBytes(commitTimeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
+  if (instant.getAction().equals(COMMIT_ACTION) || 
instant.getAction().equals(REPLACE_COMMIT_ACTION)) {
+long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
+long totalRecordsWritten = 
commitMetadata.fetchTotalRecordsWritten();
+if (totalBytesWritten > fileSizeThreshold && totalRecordsWritten > 
0) {

Review Comment:
   for very smaller files, metadata size might overshadow. for eg, with 10 
records, parquet file size should be 300 to 400k or something. could be due to 
that. or just to make sure we good sample size. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


vinishjail97 commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1503693841


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {
+  if (!commitTimeline.empty()) {
+// Go over the reverse ordered commits to get a more recent estimate 
of average record size.
+Iterator instants = 
commitTimeline.getReverseOrderedInstants().iterator();
+while (instants.hasNext()) {
+  HoodieInstant instant = instants.next();
+  HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+  .fromBytes(commitTimeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
+  if (instant.getAction().equals(COMMIT_ACTION) || 
instant.getAction().equals(REPLACE_COMMIT_ACTION)) {
+long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
+long totalRecordsWritten = 
commitMetadata.fetchTotalRecordsWritten();
+if (totalBytesWritten > fileSizeThreshold && totalRecordsWritten > 
0) {

Review Comment:
   What was the historical reason to include only those write stats where 
`totalBytesWritten > fileSizeThreshold` in the computation of 
`averageBytesPerRecord` ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


vinishjail97 commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1503691813


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/commit/TestAverageRecordSizeUtils.java:
##
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.UUID;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.model.HoodieFileFormat.HOODIE_LOG;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Test average record size estimation.
+ */
+public class TestAverageRecordSizeUtils {
+
+  private final HoodieTimeline mockTimeline = mock(HoodieTimeline.class);
+  private static final String PARTITION1 = "partition1";
+  private static final String TEST_WRITE_TOKEN = "1-0-1";
+  private static final String BASE_FILE_EXTENSION = 
HoodieTableConfig.BASE_FILE_FORMAT.defaultValue().getFileExtension();
+
+  @ParameterizedTest
+  @MethodSource("testCasesCOW")
+  public void testAverageRecordSizeForCOW(List>> instantSizePairs, long expectedSize) {

Review Comment:
   Any reason for having separate test method for Cow and Mor ? I see duplicate 
code that can be removed if make this a single test and using.  
   ```
   @ParameterizedTest
   @MethodSource("testCasesCOW", "testCasesMOR")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1965840657

   
   ## CI report:
   
   * d3a7378bb065eac3d517d9d6b8e7c8513a47f342 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22658)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


hudi-bot commented on PR #10763:
URL: https://github.com/apache/hudi/pull/10763#issuecomment-1965833858

   
   ## CI report:
   
   * d3a7378bb065eac3d517d9d6b8e7c8513a47f342 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-02-26 Thread via GitHub


nsivabalan opened a new pull request, #10763:
URL: https://github.com/apache/hudi/pull/10763

   ### Change Logs
   
   Avg record size calculation only considers COMMIT actions for now. Fixing it 
to include replace commits and delta commits as well. 
   
   ### Impact
   
   Fixing avg record size estimation for delta commits and replace commits. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org