RE: Iceberg - PySpark overwrite with a condition

Ha Cao Fri, 28 Jun 2024 09:08:17 -0700

Hi Ajantha,

Thanks for replying! The example, however, is in Java. I figure that that 
syntax probably only works for Java and Scala. I have tried similarly for 
PySpark but still got `Column is not iterable` with:
df.writeTo(spark_table_path).using("iceberg").overwrite(col("time") > 
target_timestamp)


For this, I get `Column object is not callable`:
df.writeTo(spark_table_path).using("iceberg").overwrite(col("time").less(target_timestamp))

The only example I can find in the PySpark codebase is 
https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/test_readwriter.py#L251
 but even with this, it throws `Column is not iterable`. I cannot find any 
other test case that tests `overwrite()` as a method.

Thank you!
Best,
Ha

From: Ajantha Bhat <[email protected]>
Sent: Friday, June 28, 2024 3:52 AM
To: [email protected]
Subject: Re: Iceberg - PySpark overwrite with a condition

Hi,

Please refer this doc: 
https://iceberg.apache.org/docs/nightly/spark-writes/#overwriting-data

We do have some test cases for the same: 
https://github.com/apache/iceberg/blob/91fbcaa62c25308aa815557dd2c0041f75530705/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java#L153

- Ajantha

On Fri, Jun 28, 2024 at 1:00 AM Ha Cao 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I am experimenting with PySpark’s DataFrameWriterV2 
overwrite()<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.overwrite.html>
 to an Iceberg table with existing data in a target partition. My goal is that 
instead of overwriting the entire partition, it will only overwrite specific 
rows that match the condition. However, I can’t get it to work with any syntax 
and I keep getting “Column is not iterable”. I have tried:

df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid)
df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid.isin(1))
df.writeTo(spark_table_path).using("iceberg").overwrite(df.tid >= 1)

and all of these syntaxes fail with “Column is not iterable”.

What is the correct syntax for this? I also think that there is a possibility 
that Iceberg-PySpark integration doesn’t support overwrite, but I don’t know 
how to confirm this.

Thank you so much!
Best,
Ha

RE: Iceberg - PySpark overwrite with a condition

Reply via email to