vatanrathi commented on issue #25991:
URL: https://github.com/apache/beam/issues/25991#issuecomment-1493309066
@aromanenko-dev Sorry If I was not clear before ... Let me explain
Currently we are on beam 2.23.0 versions and given job finishes in around
10min. I tried to upgrade to 2.45.0 and noticed performance issues on both aws
sdk1 and 2. So, I thought of upgrading versions step by step and thats where I
noticed that performance started degraded from ver 2.31.0. Thats where I
noticed this change which I believe is the root cause.
Below is my final findings based on several iterations of tests.
1. With aws sdk1, if I drainInputStream is removed from close() call, then
execution time is same across versions.
2. However with sdk2 , with drainInputStream call in close(), pipeline runs
for hours which takes only ~10min to finish on aws sdk1. if drainInputStream is
closed, performance is improved but it still it took ~30mins to finish. But if
s3ResponseInputStream.abort() is called before s3ResponseInputStream.close() in
close(), then performance is significantly imporved and pipeline finishes
within 3minutes.
```
@Override
public void close() throws IOException {
if (s3ResponseInputStream != null) {
**s3ResponseInputStream.abort()**
drainInputStream(s3ResponseInputStream);
s3ResponseInputStream.close();
}
open = false;
}
```
I found a bug https://github.com/aws/aws-sdk-java-v2/issues/2117 raised in
aws-sdk-java-v2 for close() call which also complains that close() call
unexpectedly waits.
For your question "I'm wondering if it's even possible that close() will be
called under normal circumstances before all data is read?", I dont know the
exact answer but I think as beam reads data in burst so when data read in first
fetch is being processed, s3 try to close connection.
If you think we can avoid close() call by tweaking some http connection
param in pipeline options or in some other way, kindly let me know
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]