[GitHub] [iceberg] wypoon commented on a change in pull request #3038: Spark: Better SparkBatchScan statistics estimation

GitBox Thu, 02 Sep 2021 16:42:53 -0700


wypoon commented on a change in pull request #3038:
URL: https://github.com/apache/iceberg/pull/3038#discussion_r701491393




##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java
##########
@@ -280,23 +280,23 @@ private static PartitionSpec identitySpec(Schema schema, 
List<String> partitionN
   }
 
   /**
-   * estimate approximate table size based on spark schema and total records.
+   * Estimate approximate table size based on Spark schema and total records.
    *
-   * @param tableSchema  spark schema
+   * @param tableSchema  Spark schema
    * @param totalRecords total records in the table
-   * @return approxiate size based on table schema
+   * @return approximate size based on table schema
    */
   public static long estimateSize(StructType tableSchema, long totalRecords) {
     if (totalRecords == Long.MAX_VALUE) {
       return totalRecords;
     }
 
-    long approximateSize = 0;
-    for (StructField sparkField : tableSchema.fields()) {
-      approximateSize += sparkField.dataType().defaultSize();
+    long result;
+    try {
+      result = LongMath.checkedMultiply(tableSchema.defaultSize(), 
totalRecords);
+    } catch (ArithmeticException e) {
+      result = Long.MAX_VALUE;

Review comment:
       `StructType` has a `defaultSize` method; there is no need to 
re-implement it. The main utility of the `estimateSize` static method is 
checking for overflow; we should just use Guava's `LongMath.checkedMultiply`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wypoon commented on a change in pull request #3038: Spark: Better SparkBatchScan statistics estimation

Reply via email to