BramBauwens opened a new issue, #7863:
URL: https://github.com/apache/iceberg/issues/7863

   ### Apache Iceberg version
   
   1.1.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   While using Apache Iceberg, I am experiencing an unexpected behavior when I 
perform a MERGE operation on a DataFrame. After executing the MERGE statement, 
the count of another DataFrame, which was not directly involved in the merge, 
inexplicably changes to 0.
   
   The sequence of operations in my use-case are as follows:
   
   1. Create two DataFrames, source_df and target_df, with a defined set of 
data.
   2. Generate an iceberg table using the target_df
   3. We, generate a third DataFrame, df, using a SQL statement that leverages 
the source_df and the iceberg table.
   4. Execute a MERGE operation using SQL to update and insert records from 
source_df into the iceberg table.
   5. Execute assertion checks to verify the counts of source_df, df, and the 
updated target_df.
   6. The assertion for the count of df fails after the MERGE operation, even 
though df is not directly involved in the merge. The count of df should remain 
constant and equal to 2, but instead, it changes to 0.
   
   The issue is not limited to a specific test function, but seems to have a 
broader impact on DataFrame operations post-MERGE operation in Apache Iceberg.
   
   Here is a snippet of the code used for reference:
   
   ```
   def test_merge_into_with_sql(self, spark, temp_iceberg_catalog):
       columns = ["FirstName", "LastName", "Age", "Timestamp"]
   
       target_data = [("John", "Doe", 30, "2023-01-01"),
                      ("Jane", "Doe", 25, "2023-01-01")]
       target_df = spark.createDataFrame(target_data, columns)
       target_df_name = "delta_df_view"
       target_df.createOrReplaceTempView(target_df_name)
   
   
       spark.sql(f"""create table test.{target_df_name} using iceberg as select 
* from {target_df_name}""")
   
       source_data = [("Jack", "Doe", 35, "2023-02-01"),
                      ("Jane", "Doe", 28, "2023-02-01")]
   
       source_df = spark.createDataFrame(source_data, columns)
       source_df_name = "source_df_view"
       source_df_name_clean = "source_df_view_clean"
       source_df.createOrReplaceTempView(source_df_name_clean)
   
       remove_double_sql = f"""
           select * from {source_df_name_clean} d 
           left anti join (
               select * from (
                   select * from {source_df_name_clean} 
                   UNION ALL 
                   select *
                   from test.{target_df_name} scd2 
                   left semi  join {source_df_name_clean} toclean
                   on scd2.FirstName = toclean.FirstName and scd2.LastName = 
toclean.LastName
                   )
               group by FirstName, LastName, Age, Timestamp
               having count(*) > 1
               ) doubles
           on d.FirstName = doubles.FirstName and d.LastName = doubles.LastName
           """
   
       df = spark.sql(remove_double_sql)
   
       assert source_df.count() == 2
       assert df.count() == 2
   
       merge_sql = f"""
       MERGE INTO test.{target_df_name} AS t
       USING {source_df_name} AS s
       ON t.FirstName = s.FirstName AND t.LastName = s.LastName
       WHEN MATCHED THEN UPDATE SET 
         t.Age = s.Age,
         t.Timestamp = s.Timestamp
       WHEN NOT MATCHED THEN INSERT (FirstName, LastName, Age, Timestamp) 
         VALUES (s.FirstName, s.LastName, s.Age, s.Timestamp)
       """
   
       source_df.createOrReplaceTempView(source_df_name)
       spark.sql(merge_sql)
       iceberg_df = spark.sql(f"""select * from test.{target_df_name}""")
   
       assert iceberg_df.count() == 3
       assert source_df.count() == 2
       # This one fails, because apparently the count() equals to 0.
       assert df.count() == 2
   ```
   
   I am using Spark 3.2.1 and Iceberg version 1.1.0.
   
   This issue is causing confusion as it is not clear why the count of a 
DataFrame not directly involved in the MERGE operation is affected. I would 
appreciate any insights into why this might be happening, and whether this 
might be a bug related to Apache Iceberg's handling of MERGE operations.
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to