haydenflinner commented on issue #2456:
URL: https://github.com/apache/iceberg/issues/2456#issuecomment-1416277410
This appears to be the final barrier to me actually inserting data to
Iceberg. Since i can't find any way to write to Iceberg except by involving
Spark, some PySpark is below. Here is my current attempt, after turning off the
spark checks, to rely only on Iceberg checks. The goal is to somehow convince
the dataframe that the column is not nullable, without hand-writing the whole
schema or incurring ridiculous speed penalty. This dataframe is only a few kb,
somehow going near createDataFrame or rdd's causes massive slowdown.
```python
# strftime seems easier than using SQL to cast datetime to DATE logical
type
if 'my_date' in df.columns:
df.my_date = df.my_date.dt.strftime('%Y-%m-%d')
# Write a parquet file for loading to spark, because
spark.createDataFrame(pandas_df) is astoundingly slow
path = tempfile.mktemp(dir="/tmp/hflinner/parquet", suffix=".parquet")
df.to_parquet(path, index=False, allow_truncated_timestamps=True,
coerce_timestamps='us')
spark = _get_spark()
sdf = spark.read.parquet(path)
# Undo strftime hack, also try to fix nullability of date and path
column.
from pyspark.sql.functions import to_date
if 'my_date' in df.columns:
#sdf = sdf.select('*', (to_date(sdf.my_date)).alias('my_date'))
sdf = sdf.withColumn('my_date', to_date(sdf.my_date, '%Y-%m-%d'))
sdf =
sdf.filter(sdf.my_date.isNotNull()).filter(sdf.backed_up_path.isNotNull())
sdf.schema['my_date'].nullable = False
sdf.schema['backed_up_path'].nullable = False
sdf.writeTo(f"dev_catalog.{tablename}").append()
-->
IllegalArgumentException: Cannot write incompatible dataset to table with
schema:
table {
1: server_name: optional string
2: my_date: required date
3: backed_up_path: required string
4: backed_up_filesize: optional long
5: num_lines: optional long
}
Provided schema:
table {
1: server_name: optional string
2: my_date: optional date
3: backed_up_path: optional string
4: backed_up_filesize: optional long
}
Problems:
* my_date should be required, but is optional
* backed_up_path should be required, but is optional
```
I don't have a schema object (the schema is written as SQL for Iceberg) and
I don't want to create an rdd.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]