[GitHub] [iceberg] openinx commented on a change in pull request #1774: [iceberg-1746] Implement spark fanout writer

GitBox Wed, 25 Nov 2020 18:04:03 -0800


openinx commented on a change in pull request #1774:
URL: https://github.com/apache/iceberg/pull/1774#discussion_r530731975




##########
File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java
##########
@@ -327,51 +328,12 @@ public void 
testUnpartitionedCreateWithTargetFileSizeViaTableProperties() throws
 
   @Test
   public void testPartitionedCreateWithTargetFileSizeViaOption() throws 
IOException {
-    File parent = temp.newFolder(format.toString());
-    File location = new File(parent, "test");
-
-    HadoopTables tables = new HadoopTables(CONF);
-    PartitionSpec spec = 
PartitionSpec.builderFor(SCHEMA).identity("data").build();
-    Table table = tables.create(SCHEMA, spec, location.toString());
-
-    List<SimpleRecord> expected = Lists.newArrayListWithCapacity(8000);
-    for (int i = 0; i < 2000; i++) {
-      expected.add(new SimpleRecord(i, "a"));
-      expected.add(new SimpleRecord(i, "b"));
-      expected.add(new SimpleRecord(i, "c"));
-      expected.add(new SimpleRecord(i, "d"));
-    }
-
-    Dataset<Row> df = spark.createDataFrame(expected, SimpleRecord.class);
-
-    df.select("id", "data").sort("data").write()
-        .format("iceberg")
-        .option("write-format", format.toString())
-        .mode("append")
-        .option("target-file-size-bytes", 4) // ~4 bytes; low enough to trigger
-        .save(location.toString());
-
-    table.refresh();
-
-    Dataset<Row> result = spark.read()
-        .format("iceberg")
-        .load(location.toString());
-
-    List<SimpleRecord> actual = 
result.orderBy("id").as(Encoders.bean(SimpleRecord.class)).collectAsList();
-    Assert.assertEquals("Number of rows should match", expected.size(), 
actual.size());
-    Assert.assertEquals("Result rows should match", expected, actual);
+    partitionedCreateWithTargetFileSizeViaOption(false);
+  }
 
-    List<DataFile> files = Lists.newArrayList();
-    for (ManifestFile manifest : table.currentSnapshot().allManifests()) {
-      for (DataFile file : ManifestFiles.read(manifest, table.io())) {
-        files.add(file);
-      }
-    }
-    // TODO: ORC file now not support target file size
-    if (!format.equals(FileFormat.ORC)) {
-      Assert.assertEquals("Should have 8 DataFiles", 8, files.size());
-      Assert.assertTrue("All DataFiles contain 1000 rows", 
files.stream().allMatch(d -> d.recordCount() == 1000));
-    }
+  @Test
+  public void testPartitionedFanoutCreateWithTargetFileSizeViaOption() throws 
IOException {
+    partitionedCreateWithTargetFileSizeViaOption(true);

Review comment:
       Do we need an unit test to cover the spark's 
`partitioned.fanout.enabled` option ?   I saw there's an unit test which use 
the table's `write.partitioned.fanout.enabled` property to define the `fanout` 
behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] openinx commented on a change in pull request #1774: [iceberg-1746] Implement spark fanout writer

Reply via email to