JingsongLi commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3407205482
##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -108,6 +124,17 @@ public void addElement(InternalRow element) throws
IOException {
write(tmpBuffer, bytesRead);
bytesRead = in.read(tmpBuffer);
}
+ } catch (IOException | RuntimeException e) {
+ if (writeNullOnMissingFile && isNotFoundError(e)) {
+ LOG.warn(
+ "Failed to read blob from {}, writing NULL for BLOB
field {}.",
+ blobUri(blob),
+ blobFieldName,
+ e);
+ writeNullElement();
Review Comment:
The missing-file fallback can leave stale bytes in the blob file.
`addElement` writes `MAGIC_NUMBER_BYTES` before opening/reading the referenced
blob, then on a 404 it records the row as NULL and returns without
rewinding/truncating `out`. A 404 from `newInputStream()` already leaves the
magic bytes behind; if a later non-null blob is written to the same file,
`BlobFileMeta` computes that later blob's offset as 0 because the NULL length
does not advance offsets, while the actual bytes start after the orphan magic
bytes. That will make subsequent blob descriptors/readback point at the wrong
offset and corrupt data.
Could we check existence before writing, or stage the blob into a temp
buffer/file and only append to `out` after the read succeeds? It would also be
good to add a test that writes a missing HTTP descriptor followed by a valid
blob and reads the valid blob back.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]