wwj6591812 commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3407477095
##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -108,6 +124,17 @@ public void addElement(InternalRow element) throws
IOException {
write(tmpBuffer, bytesRead);
bytesRead = in.read(tmpBuffer);
}
+ } catch (IOException | RuntimeException e) {
+ if (writeNullOnMissingFile && isNotFoundError(e)) {
+ LOG.warn(
+ "Failed to read blob from {}, writing NULL for BLOB
field {}.",
+ blobUri(blob),
+ blobFieldName,
+ e);
+ writeNullElement();
Review Comment:
@JingsongLi
Thanks for catching this — you are absolutely right.
The write-time 404 fallback previously wrote `MAGIC_NUMBER_BYTES` to `out`
before opening the referenced blob. When the read failed, we recorded a NULL
entry but left orphan magic bytes in the file. Because NULL entries do not
advance blob offsets, a later non-null blob could get a descriptor offset of 0
while the actual payload started after those stale bytes, which would corrupt
subsequent reads.
I have fixed this by staging the blob payload (magic number + bytes) in
memory first and only appending to `out` after the referenced blob is fully
read successfully. If the read fails with a 404 and
`blob-write-null-on-missing-file` is enabled, we now write NULL without
touching `out` at all.
I also added `testMissingHttpBlobFollowedByValidBlobPreservesReadback`,
which writes a missing HTTP descriptor followed by a valid blob and verifies
that the valid blob can be read back correctly via `BlobFileMeta` /
`BlobFormatReader`.
The primary pre-check path (`UriReaderFactory.exists()` for HTTP +
`FlinkRowWrapper.isNullAt()`) remains unchanged and should prevent most missing
URLs from reaching the write path; the staged write is now a safe fallback for
the remaining cases.
Please take another look when you have a chance. Thanks again!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]