[GitHub] [iceberg] haydenflinner commented on issue #2040: Partial data ingestion to Iceberg in failing with Spark 3.0.x

via GitHub Thu, 02 Feb 2023 14:15:30 -0800


haydenflinner commented on issue #2040:
URL: https://github.com/apache/iceberg/issues/2040#issuecomment-1414446899


   Same thing here, happening whether I use INSERT INTO or the dataframe API. 
How annoying. Is there really no solution besides messing with the dataframe 
schema to ensure it has the same number of columns as the iceberg table? Seems 
annoying to evolve the table this way.
   
   ```
       # spark.sql(
       #     f"""CREATE OR REPLACE TEMPORARY VIEW myview USING parquet
       #             OPTIONS (path "{path}")"""
       # )
       # log.info("calling-insert")
       # spark.sql(f"INSERT INTO {tablename}({', '.join(df.columns)}) SELECT * 
FROM myview")
   
   ---> leads to:
       # Table columns: 'server_name', 'backed_up_path', 'backed_up_filesize', 
'num_lines', 'backed_up_ts', 'start_ts', 'end_ts', 'first_x_ts', 'last_x_ts'
       # Data columns: 'server_name', 'backed_up_path', 'backed_up_filesize'
   ```
   or
   ```
       df.to_parquet(path, index=False, allow_truncated_timestamps=True, 
coerce_timestamps='us')
       spark = _get_spark()
       spark.sql("use dev_catalog")
       sdf = spark.read.parquet(path)
       sdf.writeTo(f"dev_catalog.{tablename}").append()
   --> leads to:
   AnalysisException: Cannot write incompatible data to table 
'dev_catalog.logfiles':
   - Cannot write nullable values to non-null column 'backed_up_path'
   - Cannot find data for output column 'num_lines'
   - Cannot find data for output column 'backed_up_ts'
   - Cannot find data for output column 'start_ts'
   - Cannot find data for output column 'end_ts'
   - Cannot find data for output column 'first_x_ts'
   - Cannot find data for output column 'last_x_ts'
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] haydenflinner commented on issue #2040: Partial data ingestion to Iceberg in failing with Spark 3.0.x

Reply via email to