JingsongLi commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3410550754
##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -96,22 +112,35 @@ public void addElement(InternalRow element) throws
IOException {
return;
}
- long previousPos = out.getPos();
- crc32.reset();
+ SeekableInputStream in;
+ try {
+ in = blob.newInputStream();
+ } catch (IOException | RuntimeException e) {
+ if (writeNullOnMissingFile && isNotFoundError(e)) {
+ LOG.warn(
+ "Failed to open blob from {}, writing NULL for BLOB
field {}.",
+ blobUri(blob),
+ blobFieldName,
+ e);
+ writeNullElement();
+ return;
+ }
+ throw e;
+ }
write(MAGIC_NUMBER_BYTES);
Review Comment:
Looks like `crc32.reset()` was dropped while removing the in-memory staging
path. The CRC field is still written per blob record below, and the old
streaming implementation reset the CRC before writing each record. Without that
reset, the second and later non-null blobs will write a checksum that also
includes bytes from previous blobs, so the blob file contains invalid
per-record CRC values. Could we reset the CRC before writing
`MAGIC_NUMBER_BYTES` for each non-null blob?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]