vatanrathi commented on issue #25991:
URL: https://github.com/apache/beam/issues/25991#issuecomment-1491229726
@iemejia You might be correct in saying that there could be an underlying
issue with amazon sdk.
This is what I did so far:
1. **beam-sdks-java-io-amazon-web-services** - I tried putting patch to
remove "drainInputStream" call from close() and performance is same across all
latest versions. But, then returns previous aws warning about "Not all bytes
read"
2. **beam-sdks-java-io-amazon-web-services2** - Putting same patch to ignore
draining resulted in improved performance but still lot worse than sdk1 ... I
noticed there seems to be an issue with closing of ResponseInputStream which
appears to be waiting for a long time. Based on a sample test it took around
6mins to close, so I added a "abort()" call before close/drain and to my
surprise it result significantly improved performance which I would expect from
latest beam + spark3
_Below logs suggest that program waited ~**6min** for closing ResponseStream_
**21:27:23** dtime="2023-03-30 21:27:15.978",
thread="idle-connection-reaper", lvl="DEBUG",
logger="software.amazon.awssdk.http.apache.internal.net.SdkSslSocket",
ctx="debug", jobId="xxxxx", executionId="xxxxx", closing
[xxxxx.s3.ap-southeast-2.amazonaws.com/52.95.131.46:443](http://xxxxx.s3.ap-southeast-2.amazonaws.com/52.95.131.46:443)
**21:33:44** dtime="2023-03-30 21:33:33.406", thread="Executor task launch
worker for task 4.0 in stage 0.0 (TID 4)", lvl="INFO",
logger="org.apache.spark.storage.memory.MemoryStore", ctx="logInfo", jobId="",
executionId="", Block rdd_8_4 stored as values in memory (estimated size 67.4
MiB, free 15.8 GiB)
After adding "abort" call before draining
(https://github.com/apache/beam/blob/master/sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/s3/S3ReadableSeekableByteChannel.java#L168)
on sdk2, I did not observe any wait ...
However, I am not sure If adding an "abort" call would cause any issue to my
program or is it a bad choice
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]