[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #119: Split files when planning scan tasks

GitBox Thu, 07 Mar 2019 14:51:15 -0800

rdsr commented on a change in pull request #119: Split files when planning scan 
tasks
URL: https://github.com/apache/incubator-iceberg/pull/119#discussion_r263603411


 ##########
 File path: 
parquet/src/test/java/com/netflix/iceberg/parquet/TestParquetSplitScan.java
 ##########
 @@ -0,0 +1,123 @@
+package com.netflix.iceberg.parquet;
+
+import com.google.common.collect.FluentIterable;
+import com.netflix.iceberg.*;
+import com.netflix.iceberg.avro.AvroSchemaUtil;
+import com.netflix.iceberg.hadoop.HadoopTables;
+import com.netflix.iceberg.io.CloseableIterable;
+import com.netflix.iceberg.io.FileAppender;
+import com.netflix.iceberg.io.InputFile;
+import com.netflix.iceberg.types.Types;
+import org.apache.avro.generic.GenericRecordBuilder;
+import org.apache.hadoop.conf.Configuration;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import java.io.File;
+import java.io.IOException;
+
+import static com.netflix.iceberg.types.Types.NestedField.required;
+import static org.apache.avro.generic.GenericData.Record;
+
+public class TestParquetSplitScan {
+  private static final Configuration CONF = new Configuration();
+  private static final HadoopTables TABLES = new HadoopTables(CONF);
+
+  private static final long SPLIT_SIZE = 16 * 1024 * 1024;
+
+  private Schema schema = new Schema(
+      required(0, "id", Types.IntegerType.get()),
+      required(1, "data", Types.StringType.get())
+  );
+
+  private Table table;
+  private File tableLocation;
+  private int noOfRecords;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Before
+  public void setup() throws IOException {
+    tableLocation = new File(temp.newFolder(), "table");
+    setupTable();
+  }
+
+  @Test
+  public void test() {
+    int nTasks = 0;
+    int nRecords = 0;
+    CloseableIterable<CombinedScanTask> tasks = table.newScan().planTasks();
+    for (CombinedScanTask task : tasks) {
+      Iterable<Record> records = records(table, schema, task);
+      for (Record record : records) {
+        Assert.assertEquals("Record " + record.get("id") + " is not read in 
order", nRecords, record.get("id"));
+        nRecords += 1;
+      }
+      nTasks += 1;
+    }
+
+    Assert.assertEquals("Total number of records read should match " + 
noOfRecords, nRecords, noOfRecords);
+    Assert.assertEquals("There should be 4 tasks created since file size is ~ 
64 mb and split size ~ 16", 4, nTasks);
+  }
+
+  private void setupTable() throws IOException {
+    table = TABLES.create(schema, tableLocation.toString());
+    table.updateProperties()
+        .set(TableProperties.SPLIT_SIZE, String.valueOf(SPLIT_SIZE))
+        .commit();
+
+    File file = temp.newFile();
+    file.delete();
+    noOfRecords = addRecordsToFile(file);
+
+    DataFile dataFile = DataFiles.builder(PartitionSpec.unpartitioned())
+        .withRecordCount(noOfRecords)
+        .withFileSizeInBytes(file.length())
+        .withPath(file.toString())
+        .withFormat(FileFormat.AVRO)
+        .build();
+
+    table.newAppend().appendFile(dataFile).commit();
+  }
+
+  private int addRecordsToFile(File file) throws IOException {
+    // With these number of records and the given schema
+    // we can effectively write a file of size ~ 64 MB
+    int nRecords = 1600000;
+    try (FileAppender<Record> writer = Parquet.write(Files.localOutput(file))
 
 Review comment:
   I first had the tests written with {{RandomDataGenerator}} and then I moved 
to  this thinking that there could be slight changes in data sizes which could 
result in the tests failing randomly. I don't think that should be the case 
though if I author the tests carefully though.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #119: Split files when planning scan tasks

Reply via email to