LakshSingla commented on code in PR #16481:
URL: https://github.com/apache/druid/pull/16481#discussion_r1617112881
##########
extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java:
##########
@@ -271,50 +328,72 @@ public void close() throws IOException
// This should be emitted as a metric
LOG.info(
"Pushed total [%d] parts containing [%d] bytes in [%d]ms.",
- numChunksPushed,
- resultsSize,
+ numChunksPushed.get(),
+ resultsSize.get(),
pushStopwatch.elapsed(TimeUnit.MILLISECONDS)
);
});
- closer.register(() ->
org.apache.commons.io.FileUtils.forceDelete(chunkStorePath));
+ try (Closer ignored = closer) {
+ if (!error) {
+ pushCurrentChunk();
+ completeMultipartUpload();
+ }
+ }
+ }
- closer.register(() -> {
- try {
- if (resultsSize > 0 && isAllPushSucceeded()) {
- RetryUtils.retry(
- () -> s3.completeMultipartUpload(
- new CompleteMultipartUploadRequest(config.getBucket(),
s3Key, uploadId, pushResults)
- ),
- S3Utils.S3RETRY,
- config.getMaxRetry()
- );
- } else {
- RetryUtils.retry(
- () -> {
- s3.cancelMultiPartUpload(new
AbortMultipartUploadRequest(config.getBucket(), s3Key, uploadId));
- return null;
- },
- S3Utils.S3RETRY,
- config.getMaxRetry()
- );
+ private void completeMultipartUpload()
+ {
+ synchronized (fileLock) {
+ while (pendingFiles.get() > 0) {
+ try {
+ LOG.info("Waiting for lock for completing multipart task for
uploadId [%s].", uploadId);
Review Comment:
I think this is high. How is parts per stage determined? How many parts per
stage would it be if there was 1TB data ingested? And how much would it be if
there was 10TB data ingested? Also consider situations when there are 4-5
layers of nested joins, and their are 10 stages. (Note: Ignore the cumulative
number of logging lines, only the number of log lines per worker) The number of
log lines per worker that mention "Waiting for the lock for completing
multipart task..." would run pretty large. Also, in your test case, I don't
think we would have started using durable storage for the super sorter, which
has drastically reduced the number of log lines.
Also, what potential conditions will it help in debugging? Let's say for a
medium job, there are 200 log lines (which doesn't include all the cases that I
have mentioned above). How can those 200 lines aid in debugging the job (and
not the connector)? Note: If you worry about a deadlock, we have jstacks which
are a much better tool for debugging.
If you think that a) The number of log lines will stay (say) <100 for the
largest of jobs and b) All those logged lines will aid someone debugging the
job we can keep it as INFO. Else I feel that the benefit of logging about
acquisition/waiting/release of locks isn't going to help someone looking
through the logs, and will most likely sift over it. DEBUG is a much better
place for such logs.
If we do choose to keep it at INFO level, we should reword it in a way that
helps the reader of the logs identify how to use these.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]