[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #830: Support name mapping resolution for parquet

GitBox Fri, 08 May 2020 00:47:26 -0700


chenjunjiedada commented on a change in pull request #830:
URL: https://github.com/apache/incubator-iceberg/pull/830#discussion_r421997668




##########
File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -199,4 +204,41 @@ public void testImportAsHiveTable() throws Exception {
     long count2 = spark.read().format("iceberg").load(DB_NAME + 
".test_partitioned_table").count();
     Assert.assertEquals("three values ", 3, count2);
   }
+
+  @Test
+  public void testImportWithIncompatibleSchema() throws Exception {
+    spark.table(qualifiedTableName).write().mode("overwrite").format("parquet")
+        .saveAsTable("original_table");
+
+    // The field is different so that it will project with name mapping
+    Schema filteredSchema = new Schema(
+            optional(1, "data", Types.StringType.get())
+    );
+
+    TableIdentifier source = new TableIdentifier("original_table");
+    Table table = catalog.createTable(
+        org.apache.iceberg.catalog.TableIdentifier.of(DB_NAME, "target_table"),
+        filteredSchema,
+        SparkSchemaUtil.specForTable(spark, "original_table"));
+    File stagingDir = temp.newFolder("staging-dir");
+    SparkTableUtil.importSparkTable(spark, source, table, 
stagingDir.toString());

Review comment:
       Where should the name mapping come from? I thought the name mapping 
should be created via the table's schema because the typical case is we create 
a table, import the parquet files from an existing spark table. In case of 
that, we only care about the newly created fields and their stats, that's why 
the visitor returns null when we cannot find a field in name mapping regardless 
of its sub-types  (it is impossible to define a schema with list/map/struct 
without a name and has the subfields).
   
   If the name mapping comes from external, for example, from the converted 
schema of parquet file schema, then the logic would be different as you pointed 
out. We might need 1) store name mapping for Table, 2) a Table API to get name 
mapping, 3) change `MessageTypeToType` to visit schema with name mapping. That 
seems to bring in too many complexities.
   
   Back to the issue itself, what we need is creating a name mapping when 
`hasIds` is false. So we can add another property to Parquet ReadBuilder, say 
`enableNameMapping`, then we could build a default name mapping from 
`expectedSchema` even though we don't pass a name mapping. The logic in the 
ReadConf.java would be:
   
   ```
   NameMapping actualNameMapping;
   if (enableNameMapping) {
     if(nameMapping != null) {
       actualNameMapping = nameMapping;
     } else {
       actualNameMapping = MappingUtil.create(expectedSchema);
     }
   ```
   
   What do you think?
   
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #830: Support name mapping resolution for parquet

Reply via email to