davseitsev opened a new issue #1286:
URL: https://github.com/apache/iceberg/issues/1286
I have a situation when two parallel processes modify single table.
First process is Spark Structured Streaming query which reads from Kafka and
continuously appends the table with trigger period 10s.
Another process is continuous compaction which works in a following way:
1. List files in the latest partition and take **N** GB of data. Small files
have bigger priority.
2. Run Spark job which reads collected files and produce 1 big file located
in partition path.
3. Atomically replace small files with bi one like this:
```
table.newRewrite()
.rewriteFiles(compactingFiles, resultFiles)
.commit()
```
This approach stop working when partition grew and `RewriteFiles` operation
becomes slow. When `SnapshotProducer` tries to commit rewrite operation it
fails with exception:
> Base metadata location
'db_path/the_table/metadata/01334-c9c69f57-eb55-4e34-bd5e-beeab380c10c.metadata.json'
is not same as the current table metadata location
'db_path/the_table/metadata/metadata/01335-1070b870-5ea3-4dc6-9708-493b724ee8f1.metadata.json'
for db_path.the_table
Then it tries to obtain the latest state of the table and commit it again.
But first streaming process has already appended the table and commit fails
again. After 3 failed attempts compaction job fails.
In my opinion the problem is that `HiveTableOperations` removes table lock
between successive attempts. It allows concurrent streaming process to append
some new data between the attempts and it causes the problem. Keeping the table
locked between commit attempts would work.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]