merge into on hadoopTables using Spark

GitBox Sun, 19 Jun 2022 17:07:23 -0700


CodingCat opened a new pull request, #5083:
URL: https://github.com/apache/iceberg/pull/5083


   when upgrading to Spark 3.2, I found an issue that we cannot use 
update/delete/merge into with tables we created with `HadoopTables.create()`, 
this issue doesn't exist in 3.0 - 3.1. 
   
   Previously, we use Spark to read and merge/delete/update hadoop tables with 
the following approach
   
   ```scala
   val hadoopTableDF = spark.read.format("iceberg").load(path)
   hadoopTableDF.createOrReplaceTempView("target")
   newDF.createOrReplaceTempView("source")
   spark.sql("MERGE INTO target using source on target.id = source.id WHEN 
MATCHED THEN... WHEN NOT MATCHED THEN ...")
   ```
   
   this doesn't work because the analyzer rules in iceberg with Spark 3.2 only 
recognize tables already registered in catalog 
   
   While I am aware of that catalog usage is recommended in production, I think 
this compatibility is still needed
   
   * any breaking change from Spark 3.0/3.1 to Spark 3.2 is not desired anyway
   * strategically, I think many users in Delta Lake has been used to "path 
tables" (like ourselves), this incompatibility is a barrier for us to 
completely move over
   * using pure path table is a lightweight approach for experiments, we do not 
need to clean those temp tables names in catalog but only creating tables in 
some s3 bucket with TTL
   * personally, we do have many tables already created as hadoop table , 
upgrading to Spark 3.2 move everything to catalog is an unnecessary part when 
we upgrade to Spark 3.2
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] CodingCat opened a new pull request, #5083: support update/delete/merge into on hadoopTables using Spark

Reply via email to