JingsongLi commented on code in PR #8219:
URL: https://github.com/apache/paimon/pull/8219#discussion_r3408926391
##########
paimon-format/src/main/java/org/apache/paimon/format/blob/BlobFormatWriter.java:
##########
@@ -97,21 +114,35 @@ public void addElement(InternalRow element) throws
IOException {
}
long previousPos = out.getPos();
- crc32.reset();
-
- write(MAGIC_NUMBER_BYTES);
-
- long blobPos = out.getPos();
+ ByteArrayOutputStream stagedPayload = new ByteArrayOutputStream();
Review Comment:
Could we avoid staging the entire payload in memory here? This writer is
used for normal blob writes too, so the new `ByteArrayOutputStream` +
`toByteArray()` path can allocate roughly 2x the blob size before writing.
Since BLOBs are expected to be large and `blob.target-file-size` can be much
larger than task memory, a valid large blob can now OOM even when
`writeNullOnMissingFile` is false. One safer approach is to open the source
stream before appending anything to `out`; if opening hits 404, write NULL,
otherwise keep the previous streaming copy path.
##########
paimon-api/src/main/java/org/apache/paimon/rest/HttpClientUtils.java:
##########
@@ -86,9 +88,61 @@ private static HttpClientConnectionManager
configureConnectionManager() {
public static InputStream getAsInputStream(String uri) throws IOException {
HttpGet httpGet = new HttpGet(uri);
CloseableHttpResponse response = DEFAULT_HTTP_CLIENT.execute(httpGet);
- if (response.getCode() != 200) {
- throw new RuntimeException("HTTP error code: " +
response.getCode());
+ int statusCode = response.getCode();
+ if (statusCode != HttpStatus.SC_OK) {
+ throw httpError(statusCode);
}
return response.getEntity().getContent();
}
+
+ /**
+ * Checks whether an HTTP resource exists. HEAD is attempted first; when
the server does not
+ * support HEAD, a lightweight GET with {@code Range: bytes=0-0} is used
as fallback.
+ */
+ public static boolean exists(String uri) throws IOException {
+ int statusCode = headStatusCode(uri);
Review Comment:
Is it safe to make the HEAD result authoritative here? Some HTTP resources,
including signed or GET-only URLs, can reject HEAD or return a different status
than GET even though the payload is readable. With this change, a HEAD 404/403
would make `exists()` return false or throw before trying the range GET, so an
existing blob could be treated as missing or fail the write. Could we fall back
to `Range: bytes=0-0` GET for non-success HEAD statuses as well, at least for
403/404, and only return false when the GET also returns 404?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]