[GitHub] [iceberg] itaise opened a new issue, #4510: Spark overwrite issue when partition changes

GitBox Wed, 06 Apr 2022 01:58:18 -0700


itaise opened a new issue, #4510:
URL: https://github.com/apache/iceberg/issues/4510


   Hi, 
   we are writing to iceberg using spark, and when renaming the partition field 
name, we are getting a validation error: 
   ```
   org.apache.iceberg.exceptions.ValidationException: Cannot find source column 
for partition field: 1000: some_date: void(1)
   ```
   
   It seems like iceberg is referring to the existing table partition field 
name, which is irrelevant anymore - as there is a new partition field, and the 
write mode is "overwrite". 
   
   Can you please assist? 
   Thank you!
   
   Here is a minimal reproducible example:
   
   1. create the original table with partition field "some_date":
   ```
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType ,StructField, StringType
   dataDF = [('1991-04-01',)]
   schema = StructType([
           StructField('some_date',StringType(), True)])
   
   spark = SparkSession.builder.master('local[1]').appName('example') \
       .getOrCreate()
   
   df = spark.createDataFrame(data = dataDF, schema = schema)
   spark.sql(f"use iprod")  # catalog
   spark.sql(f"CREATE SCHEMA IF NOT EXISTS iprod.test_schema")
   
   
df.write.mode("overwrite").format("parquet").partitionBy('some_date').saveAsTable("iprod.test_schema.example")
   ```
   
   2. Try to overwrite the table with the same code, but the partition field 
renamed to ```some_date_2```
   ```
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType ,StructField, StringType
   dataDF = [('1991-04-01',)]
   schema = StructType([
           StructField('some_date_2',StringType(), True)])
   
   spark = SparkSession.builder.master('local[1]').appName('example') \
       .getOrCreate()
   
   df = spark.createDataFrame(data = dataDF, schema = schema)
   spark.sql(f"use iprod")  # catalog
   spark.sql(f"CREATE SCHEMA IF NOT EXISTS iprod.test_schema")
   
   
df.write.mode("overwrite").format("parquet").partitionBy('some_date_2').saveAsTable("iprod.test_schema.example")
   ```
   
   Full trace: 
   ```
   : org.apache.iceberg.exceptions.ValidationException: Cannot find source 
column for partition field: 1000: some_date: void(1)
        at 
org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:46)
        at 
org.apache.iceberg.PartitionSpec.checkCompatibility(PartitionSpec.java:511)
        at 
org.apache.iceberg.PartitionSpec$Builder.build(PartitionSpec.java:503)
        at 
org.apache.iceberg.TableMetadata.reassignPartitionIds(TableMetadata.java:768)
        at 
org.apache.iceberg.TableMetadata.buildReplacement(TableMetadata.java:790)
        at 
org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.newReplaceTableTransaction(BaseMetastoreCatalog.java:256)
        at 
org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.createOrReplaceTransaction(BaseMetastoreCatalog.java:244)
        at 
org.apache.iceberg.CachingCatalog$CachingTableBuilder.createOrReplaceTransaction(CachingCatalog.java:244)
        at 
org.apache.iceberg.spark.SparkCatalog.stageCreateOrReplace(SparkCatalog.java:190)
        at 
org.apache.spark.sql.execution.datasources.v2.AtomicReplaceTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:197)
        at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
        at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
        at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.doExecute(V2CommandExec.scala:55)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
        at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
        at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:686)
        at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:619)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] itaise opened a new issue, #4510: Spark overwrite issue when partition changes

Reply via email to