steveloughran commented on code in PR #7151: URL: https://github.com/apache/hadoop/pull/7151#discussion_r1842246549
########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/connecting.md: ########## @@ -485,6 +485,90 @@ If `storediag` doesn't connect to your S3 store, *nothing else will*. Based on the experience of people who field support calls, here are some of the main connectivity issues which cause problems. +### <a name="Not-enough-connections"></a> Connection pool overloaded + +If more connections are needed than the HTTP connection pool has, +then worker threads will block until one is freed. + +If the wait exceeds the time set in `fs.s3a.connection.acquisition.timeout`, +the operation will fail with `"Timeout waiting for connection from pool`. + +This may be retried, but time has been lost, which results in slower operations. +If queries suddenly gets slower as the number of active operations increase, +then this is a possible cause. + +Fixes: + +Increase the value of `fs.s3a.connection.maximum`. +This is the general fix on query engines such as Apache Spark, and Apache Impala +which run many workers threads simultaneously, and do not keep files open past +the duration of a single task within a larger query. + +It can also surface with applications which deliberately keep files open +for extended periods. +These should ideally call `unbuffer()` on the input streams. +This will free up the connection until another read operation is invoked -yet +still re-open faster than if `open(Path)` were invoked. + +Applications may also be "leaking" http connections by failing to +`close()` them. This is potentially fatal as eventually the connection pool +can get exhausted -at which point the program will no longer work. + +This can only be fixed in the application code: it is _not_ a bug in +the S3A filesystem. + +1. Applications MUST call `close()` on an input stream when the contents of + the file are longer needed. +2. If long-lived applications eventually fail with unrecoverable + `ApiCallTimeout` exceptions, they are not doing so. + +To aid in identifying the location of these leaks, when a JVM garbage +collection releases an unreferenced `S3AInputStream` instance, +it will log at `WARN` level that it has not been closed, +listing the file URL, and the thread name + ID of the the thread +which creating the file. +The the stack trace of the `open()` call will be logged at `INFO` Review Comment: I considered only collecting the stack if the specific log was at error, but decided it was just adding risk of failures and hard to test. The extra overhead is marginal given we are about to do network reads. Also gives us a full stack during any debugging. Now, one future enhancement would be to include that in the .toString() of the reporter, and then include that in the .toString() of the stream. Just a thought. Not doing it here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
