kbendick commented on a change in pull request #3435:
URL: https://github.com/apache/iceberg/pull/3435#discussion_r740417000
##########
File path:
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -292,16 +293,17 @@ private SparkTableUtil() {
* @param conf a serializable Hadoop conf
* @param metricsConfig a metrics conf
* @param mapping a name mapping
+ * @param schema schema of the target table
Review comment:
Nit: Consider adding a note that this is nullable (as it's only used in
Parquet).
Though if the schema is always available in places we use it, then maybe not
a concern and easier to require people to pass it through in case we add
support for other file formats.
##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -181,4 +195,20 @@ private TableMigrationUtil() {
throw new RuntimeException("Unable to list files in partition: " +
partitionUri, e);
}
}
+
+ /**
+ * @param importSchema the schema of file being imported
+ * @param tableSchema schema of the table to which file is to be imported
+ * @return if the files schema is compatible with table's schema
+ */
+ public static boolean canImportSchema(Schema importSchema, Schema
tableSchema) {
+ // Assigning ids by name look up, required for checking compatibility
+ Schema schemaWithIds = TypeUtil.assignFreshIds(importSchema, tableSchema,
new AtomicInteger(1000)::incrementAndGet);
+ List<String> errors =
CheckCompatibility.writeCompatibilityErrors(tableSchema, schemaWithIds, false);
Review comment:
Nit: Consider adding an in-line comment to the `false` to document what
the parameter is with `false /* ??? */`.
##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -73,11 +84,11 @@ private TableMigrationUtil() {
*/
public static List<DataFile> listPartition(Map<String, String> partition,
String uri, String format,
PartitionSpec spec, Configuration
conf, MetricsConfig metricsConfig,
- NameMapping mapping) {
+ NameMapping mapping, Schema
schema) {
Review comment:
Nit: Consider adding a comment to the javadoc saying schema is ignored
when not using parquet.
##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -129,6 +140,9 @@ private TableMigrationUtil() {
try {
ParquetMetadata metadata = ParquetFileReader.readFooter(conf,
stat);
metrics = ParquetUtil.footerMetrics(metadata, Stream.empty(),
metricsSpec, mapping);
+ if
(!canImportSchema(ParquetSchemaUtil.convert(metadata.getFileMetaData().getSchema()),
schema)) {
+ throw new ValidationException("Mismatch in table and imported
file schema");
Review comment:
Non-blocking question: Should we consider throwing the
`ValidationException` from canImportSchema, so that the non-matching fields
will be most visible in the initial exception message?
I can picture the support requests where the answer is "scroll up in the
logs" just flooding in 🙈. Might be wise to cut that off as much as possible
from the start.
##########
File path: data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java
##########
@@ -129,6 +140,9 @@ private TableMigrationUtil() {
try {
ParquetMetadata metadata = ParquetFileReader.readFooter(conf,
stat);
metrics = ParquetUtil.footerMetrics(metadata, Stream.empty(),
metricsSpec, mapping);
+ if
(!canImportSchema(ParquetSchemaUtil.convert(metadata.getFileMetaData().getSchema()),
schema)) {
+ throw new ValidationException("Mismatch in table and imported
file schema");
Review comment:
Non-blocking question: Should we consider throwing the
`ValidationException` from `canImportSchema`, so that the non-matching fields
will be most visible in the initial exception message?
I can picture the support requests where the answer is "scroll up in the
logs" just flooding in 🙈. Might be wise to cut that off as much as possible
from the start.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]