malon64 opened a new issue, #16741:
URL: https://github.com/apache/iceberg/issues/16741
### Feature Request / Improvement
### Feature Request / Improvement
Iceberg Java exposes a high-level API for create-or-replace table semantics
through:
```java
catalog.buildTable(identifier, schema)
.createOrReplaceTransaction();
```
This is very useful because the client does not need to implement the full
replacement logic manually. The Java implementation can decide whether the
operation is a create or a replace, build the correct table metadata, assign
field IDs correctly, prepare the replacement transaction, and commit it using
Iceberg’s transaction model.
However, for non-Java clients that interact with Iceberg only through the
REST Catalog API, there does not seem to be an equivalent high-level primitive.
The REST API exposes the lower-level commit mechanism through
`CommitTableRequest` / `UpdateTableRequest`, but it does not expose a staged
create-or-replace transaction workflow equivalent to Java’s
`createOrReplaceTransaction()`.
This creates a difficult situation for clients such as C++ engines:
* Implementing `CREATE OR REPLACE TABLE` as `DROP TABLE` followed by `CREATE
TABLE` is not atomic.
* If the delete succeeds and the create fails, the table can disappear.
* If another writer acts between the two calls, the catalog state can become
inconsistent from the user’s point of view.
* Reimplementing Java’s replacement planning logic outside the Java library
is complex and easy to get wrong.
* A naïve REST metadata update is also not necessarily equivalent to a true
Iceberg replace transaction.
### Motivation
I am looking at this from the perspective of a non-Java REST client,
specifically a C++ engine integration.
For Java engines, this is mostly hidden behind the Iceberg Java API. For
example, Trino can rely on the Iceberg Java catalog/transaction APIs, even when
using a REST catalog.
For C++ clients, there is no equivalent API available through the REST
specification. The client either has to:
1. fall back to unsafe `DROP` + `CREATE` behavior, or
2. manually reconstruct the replacement metadata, requirements,
snapshot/reference changes, field IDs, partition spec handling, and concurrency
checks.
Neither option is ideal.
The missing piece is not necessarily the final commit endpoint. The existing
`UpdateTableRequest` / `CommitTableRequest` mechanism may still be the right
final commit primitive. What is missing is a standardized staged planning
operation that gives non-Java clients the same safe replacement semantics that
Java clients get from `createOrReplaceTransaction()`.
### Why a simple final `create-or-replace` endpoint may not be enough
In the Java REST implementation, `createOrReplaceTransaction()` cannot
simply be decided at the final commit step, because create and replace can
assign different field IDs, and those IDs may be used in data and metadata
files before the transaction is committed.
So this probably should not be just:
```http
POST /v1/{prefix}/namespaces/{namespace}/tables/{table}/create-or-replace
```
as a final commit call.
A more useful design may be a staged operation, for example a REST-level
equivalent of:
```java
createOrReplaceTransaction()
```
that lets the client know, before writing data files, whether the operation
is being planned as a create or a replace and what metadata/IDs should be used.
### Possible direction
Would it make sense for the REST Catalog spec to expose a staged
create-or-replace / staged replace transaction workflow?
For example, something conceptually similar to:
```http
POST
/v1/{prefix}/namespaces/{namespace}/tables/{table}/stage-create-or-replace
```
or an extension of the existing staged create flow with a create-or-replace
mode.
The response could provide enough information for non-Java clients to
continue safely, such as:
* whether the operation is a create or a replace
* the planned table metadata
* assigned schema/field IDs
* partition spec and sort order IDs
* table location
* required optimistic concurrency requirements
* metadata updates needed for the final commit
* credentials/config needed to write data files
Then the final commit could still use the existing REST commit mechanism.
### Expected behavior
A REST-level staged create-or-replace primitive should allow a non-Java
client to implement:
```sql
CREATE OR REPLACE TABLE t AS SELECT ...
```
without using `DROP TABLE` + `CREATE TABLE`.
For an existing table, the expected behavior would be:
* preserve transactionality at the catalog metadata level
* avoid any window where the table disappears
* build replacement metadata correctly
* fail on concurrent conflicting changes instead of silently overwriting them
* allow the new snapshot to become the current table state
* optionally keep previous snapshots/history according to Iceberg semantics
For a missing table, it should behave like a staged create.
### Relationship to existing issues
I saw issue #16232, which discusses correctness problems around `REPLACE
TABLE` transactions and concurrent committed changes.
This proposal is related, but not the same. #16232 is about making replace
transactions safe/correct. This issue is about exposing a high-level REST
primitive so non-Java clients can access equivalent create-or-replace
transaction semantics without reimplementing the Java logic or falling back to
non-atomic drop/create behavior.
### Question
Would the Iceberg community be open to adding a REST Catalog staged
create-or-replace / staged replace transaction API?
If yes, what would be the preferred design direction?
* Extend staged create?
* Add a staged replace endpoint?
* Add a staged create-or-replace endpoint?
* Or is the expectation that every non-Java REST client should reconstruct
the replacement transaction locally using `CommitTableRequest` requirements and
updates?
### Query engine
Other
### Willingness to contribute
- [x] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]