wwj6591812 commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3407477095


##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -108,6 +124,17 @@ public void addElement(InternalRow element) throws 
IOException {
                 write(tmpBuffer, bytesRead);
                 bytesRead = in.read(tmpBuffer);
             }
+        } catch (IOException | RuntimeException e) {
+            if (writeNullOnMissingFile && isNotFoundError(e)) {
+                LOG.warn(
+                        "Failed to read blob from {}, writing NULL for BLOB 
field {}.",
+                        blobUri(blob),
+                        blobFieldName,
+                        e);
+                writeNullElement();

Review Comment:
   @JingsongLi 
   
   Thanks for catching this — you are absolutely right.
   
   The write-time 404 fallback previously wrote `MAGIC_NUMBER_BYTES` to `out` 
before opening the referenced blob. When the read failed, we recorded a NULL 
entry but left orphan magic bytes in the file. Because NULL entries do not 
advance blob offsets, a later non-null blob could get a descriptor offset of 0 
while the actual payload started after those stale bytes, which would corrupt 
subsequent reads.
   
   I have fixed this by staging the blob payload (magic number + bytes) in 
memory first and only appending to `out` after the referenced blob is fully 
read successfully. If the read fails with a 404 and 
`blob-write-null-on-missing-file` is enabled, we now write NULL without 
touching `out` at all.
   
   I also added `testMissingHttpBlobFollowedByValidBlobPreservesReadback`, 
which writes a missing HTTP descriptor followed by a valid blob and verifies 
that the valid blob can be read back correctly via `BlobFileMeta` / 
`BlobFormatReader`.
   
   The primary pre-check path (`UriReaderFactory.exists()` for HTTP + 
`FlinkRowWrapper.isNullAt()`) remains unchanged and should prevent most missing 
URLs from reaching the write path; the staged write is now a safe fallback for 
the remaining cases.
   
   Please take another look when you have a chance. Thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to