[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

GitBox Wed, 22 Jul 2020 18:13:13 -0700


rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r459167500




##########
File path: 
spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkSchema24.java
##########
@@ -19,5 +19,43 @@
 
 package org.apache.iceberg.spark.source;
 
+import java.io.IOException;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.util.PropertyUtil;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.Assert;
+import org.junit.Test;
+
 public class TestSparkSchema24 extends TestSparkSchema {
+
+  @Test
+  public void testReaderUtils() throws IOException {
+    String tableLocation = temp.newFolder("iceberg-table").toString();
+    HadoopTables tables = new HadoopTables(CONF);
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    tables.create(SCHEMA, spec, null, tableLocation);
+    List<SimpleRecord> expectedRecords = Lists.newArrayList(
+        new SimpleRecord(1, "a")
+    );
+    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, 
SimpleRecord.class);
+    originalDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(tableLocation);
+
+    Table table = tables.load(tableLocation);
+    long totalRecords = 
PropertyUtil.propertyAsLong(table.currentSnapshot().summary(),
+              SnapshotSummary.TOTAL_RECORDS_PROP, Long.MAX_VALUE);
+    Assert.assertEquals("totalRecords match", 1, totalRecords);
+    long tableSize = 
ReaderUtils.approximateTableSize(SparkSchemaUtil.convert(table.schema()), 
totalRecords);
+    Assert.assertEquals("table size matches with expected approximation", 24, 
tableSize);

Review comment:
       I don't think the method being tested warrants a test that creates a 
table, writes data, etc. The method has 2 cases:
   
   * `numRows` is `Long.MAX_VALUE` -> return `Long.MAX_VALUE`
   * `numRows` is not -> multiply `numRows` by `sizeEstimate(dataType)`
   
   All you need is 2 test cases: one for each possibility for `numRows`. The 
second case should make sure the estimate is sane.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Reply via email to