[GitHub] [iceberg] kbendick commented on a change in pull request #3435: Add schema validation during parquet file Import

GitBox Tue, 02 Nov 2021 10:49:25 -0700


kbendick commented on a change in pull request #3435:
URL: https://github.com/apache/iceberg/pull/3435#discussion_r740417000




##########
File path: 
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -292,16 +293,17 @@ private SparkTableUtil() {
    * @param conf a serializable Hadoop conf
    * @param metricsConfig a metrics conf
    * @param mapping a name mapping
+   * @param schema schema of the target table

Review comment:
       Nit: Consider adding a note that this is nullable (as it's only used in 
Parquet).
   
   Though if the schema is always available in places we use it, then maybe not 
a concern and easier to require people to pass it through in case we add 
support for other file formats.

##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -181,4 +195,20 @@ private TableMigrationUtil() {
       throw new RuntimeException("Unable to list files in partition: " + 
partitionUri, e);
     }
   }
+
+  /**
+   * @param importSchema the schema of file being imported
+   * @param tableSchema schema of the table to which file is to be imported
+   * @return if the files schema is compatible with table's schema
+   */
+  public static boolean canImportSchema(Schema importSchema, Schema 
tableSchema) {
+    // Assigning ids by name look up, required for checking compatibility
+    Schema schemaWithIds = TypeUtil.assignFreshIds(importSchema, tableSchema, 
new AtomicInteger(1000)::incrementAndGet);
+    List<String> errors = 
CheckCompatibility.writeCompatibilityErrors(tableSchema, schemaWithIds, false);

Review comment:
       Nit: Consider adding an in-line comment to the `false` to document what 
the parameter is with `false /* ??? */`.

##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -73,11 +84,11 @@ private TableMigrationUtil() {
    */
   public static List<DataFile> listPartition(Map<String, String> partition, 
String uri, String format,
                                              PartitionSpec spec, Configuration 
conf, MetricsConfig metricsConfig,
-                                             NameMapping mapping) {
+                                             NameMapping mapping, Schema 
schema) {

Review comment:
       Nit: Consider adding a comment to the javadoc saying schema is ignored 
when not using parquet.

##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -129,6 +140,9 @@ private TableMigrationUtil() {
             try {
               ParquetMetadata metadata = ParquetFileReader.readFooter(conf, 
stat);
               metrics = ParquetUtil.footerMetrics(metadata, Stream.empty(), 
metricsSpec, mapping);
+              if 
(!canImportSchema(ParquetSchemaUtil.convert(metadata.getFileMetaData().getSchema()),
 schema)) {
+                throw new ValidationException("Mismatch in table and imported 
file schema");

Review comment:
       Non-blocking question: Should we consider throwing the 
`ValidationException` from canImportSchema, so that the non-matching fields 
will be most visible in the initial exception message?
   
   I can picture the support requests where the answer is "scroll up in the 
logs" just flooding in 🙈. Might be wise to cut that off as much as possible 
from the start.

##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -129,6 +140,9 @@ private TableMigrationUtil() {
             try {
               ParquetMetadata metadata = ParquetFileReader.readFooter(conf, 
stat);
               metrics = ParquetUtil.footerMetrics(metadata, Stream.empty(), 
metricsSpec, mapping);
+              if 
(!canImportSchema(ParquetSchemaUtil.convert(metadata.getFileMetaData().getSchema()),
 schema)) {
+                throw new ValidationException("Mismatch in table and imported 
file schema");

Review comment:
       Non-blocking question: Should we consider throwing the 
`ValidationException` from `canImportSchema`, so that the non-matching fields 
will be most visible in the initial exception message?
   
   I can picture the support requests where the answer is "scroll up in the 
logs" just flooding in 🙈. Might be wise to cut that off as much as possible 
from the start.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #3435: Add schema validation during parquet file Import

Reply via email to