impothnis opened a new issue, #14625:
URL: https://github.com/apache/iceberg/issues/14625
### Apache Iceberg version
1.6.1
### Query engine
Spark
### Please describe the bug 🐞
When calling DataFrame.writeTo(...).using("iceberg").createOrReplace()
against object storage (Google Cloud Storage, ADLS Gen2, Fabric OneLake) the
write fails with FileNotFoundException while trying to read
metadata/version-hint.text. Observed behavior: the operation raises the error
and the job fails even though data + metadata files (including a
version-hint.text file) are created at the table location. create() and
append() succeed; the problem appears specific to createOrReplace() /
createIfNotExists semantics on these object stores.
> Environment
Repository: apache/iceberg
Spark: 3.5.1
Scala: 2.12
Catalog: HadoopCatalog (spark.sql.catalog.spark_catalog.type = "hadoop")
Iceberg version: <please fill: e.g. 1.3.0>
Hadoop version: <please fill>
GCS connector version: <please fill>
Azure ADLS Gen2 / OneLake client versions: <please fill>
OS / JVM: <please fill>
Language composition of repo: Java heavy (not strictly needed here)
> How to Reproduce :
`// Storage auth/config (GCS / ADLS Gen2 / OneLake)
spark.conf.set("fs.gs.impl", "<...>")
spark.conf.set("fs.AbstractFileSystem.gs.impl", "<...>")
spark.conf.set("fs.gs.project.id", "<...>")
spark.conf.set("fs.gs.auth.type", "<...>")
spark.conf.set("google.cloud.auth.service.account.enable", "<...>")
spark.conf.set("google.cloud.auth.service.account.json.keyfile", "<...>")
spark.conf.set("fs.gs.path.encoding", "<...>")
// Iceberg catalog
spark.conf.set("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.spark_catalog.type", "hadoop")
spark.conf.set("spark.sql.catalog.spark_catalog.warehouse",
"<warehouse-path>")
import spark.implicits._
val data = Seq((4, "Liam"), (5, "Noel"))
val df = data.toDF("id", "name")
// Intended: create or replace table
df.writeTo("iceberg_standalone").using("Iceberg").createOrReplace()`
> Observed errors (examples)
**Google Cloud:**
Error reading version hint file <redacted>/.../metadata/version-hint.text
java.io.FileNotFoundException: Item not found:
'<redacted>/.../metadata/version-hint.text'. Note, it is possible that the live
version is still available but the requested generation is deleted. at
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(...)
**OneLake:**
Caused by: Operation failed: "The specified path does not exist.", 404,
HEAD,
https://onelake.dfs.fabric.microsoft.com/<redacted>/.../metadata/version-hint.text?...
**ADLS Gen2:**
WARN HadoopTableOperations: Error reading version hint file
abfss://<redacted>/.../metadata/version-hint.text
java.io.FileNotFoundException: Operation failed: "The specified path does not
exist.", 404, HEAD, ...
Important note Despite the exception and the write failing, the storage path
ends up containing data and metadata files and a version-hint.text file with a
valid value. create() and append() work as expected. The issue only appears for
createOrReplace() when the table does not already exist.
Expected behavior createOrReplace() should:
create the table if it does not exist, then write the data, and return
success; or
if the table exists, atomically replace it as documented. It should not fail
with a FileNotFoundException when the table does not already exist on object
stores.
Related
> Possibly related:
https://github.com/apache/iceberg/issues/1496
Additional details / guesses
It appears createOrReplace() attempts to read version-hint.text (or
otherwise probes existing metadata) and that probe is returning a
FileNotFound/404 for object store HEAD calls. That error seems to either be
treated as a fatal I/O exception or is propagated up the call stack, causing
the operation to fail even though later metadata is created successfully.
The behavior may be caused by object store connector semantics for HEAD/GET
on non-existent paths (404 vs returning an indicator) and how Iceberg's
TableOperations handles those exceptions during createOrReplace.
Workarounds tried
Using create() and append() — both succeed.
Manually checking for table existence in the integration layer and calling
create() only when table absent — works, but the integration layer currently
assumes createOrReplace() will handle that atomically.
Request
Can maintainers investigate functionality of createOrReplace()
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]