JingsongLi commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3407205482


##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -108,6 +124,17 @@ public void addElement(InternalRow element) throws 
IOException {
                 write(tmpBuffer, bytesRead);
                 bytesRead = in.read(tmpBuffer);
             }
+        } catch (IOException | RuntimeException e) {
+            if (writeNullOnMissingFile && isNotFoundError(e)) {
+                LOG.warn(
+                        "Failed to read blob from {}, writing NULL for BLOB 
field {}.",
+                        blobUri(blob),
+                        blobFieldName,
+                        e);
+                writeNullElement();

Review Comment:
   The missing-file fallback can leave stale bytes in the blob file. 
`addElement` writes `MAGIC_NUMBER_BYTES` before opening/reading the referenced 
blob, then on a 404 it records the row as NULL and returns without 
rewinding/truncating `out`. A 404 from `newInputStream()` already leaves the 
magic bytes behind; if a later non-null blob is written to the same file, 
`BlobFileMeta` computes that later blob's offset as 0 because the NULL length 
does not advance offsets, while the actual bytes start after the orphan magic 
bytes. That will make subsequent blob descriptors/readback point at the wrong 
offset and corrupt data.
   
   Could we check existence before writing, or stage the blob into a temp 
buffer/file and only append to `out` after the read succeeds? It would also be 
good to add a test that writes a missing HTTP descriptor followed by a valid 
blob and reads the valid blob back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to