steFaiz commented on code in PR #8191:
URL: https://github.com/apache/paimon/pull/8191#discussion_r3393044213
##########
paimon-core/src/main/java/org/apache/paimon/io/FormatTableSingleFileWriter.java:
##########
@@ -64,6 +65,10 @@ public FormatTableSingleFileWriter(
out = fileIO.newTwoPhaseOutputStream(path, false);
writer = factory.create(out, compression);
}
+ if (writer instanceof FileAwareFormatWriter) {
+ FileAwareFormatWriter fileAwareFormatWriter =
(FileAwareFormatWriter) writer;
+ fileAwareFormatWriter.setFile(path);
Review Comment:
@JingsongLi Thanks for your reivew! But this scenario is a little bit
tricky.
Currently FormatTable on DFS uses RENAME to do two-phase-commit. So the set
path is not real, only exists after commit! At that case, if commit failed and
aborted, it's meaningless to retain the written files, because they are in temp
dir and not equal to `path` stored in BlobDescriptors.
(However in python, no two-phase commit implemented, so I still retain
written files on abortion)
Here're my thinkings:
1. Maybe we could explicitly warn users that in FormatTable, returned
blobDescriptors are only valid after commit? Or maybe introduce a
PendingBlobDescriptor for format tables, all same as BlobDescriptors but
BlobRef could warn users the Descriptor is still pending, rather than throws
`path not exists`.
2. I think this "visible after commit" is acceptable for batch scenarios,
for example: in Spark/Ray, FormatTable commit is a part of job, exported
descriptors will be visible only after the job is succesfully finished.
3. Or maybe we do not use two-phase commit for BlobFormatTables? Just filter
out the broken files on read.
Thanks again for your review! I'll close this PR and find an another way if
you think this scenario is not suitable for paimon FormatTable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]