[
https://issues.apache.org/jira/browse/HDDS-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-4815:
----------------------------------
Description:
HDDS-4320 implemented unbuffer API. However, I found it changed
BlockInputStream.close(). The close() call is supposed to close the enclosed
ChunkInputStreams, but that logic was moved to unbuffer() and close() never
closes ChunkInputStreams, causing socket leak.
Running a small 100GB SparkSQL TPC-DS workloads on a 3-node cluster, the test
set couldn't complete because the processes failed with
"java.net.BindException: Cannot assign requested address" message. Further
investigation found the process created tens of thousands of sockets (lsof -p).
I was finally able to pinpoint the source of leak using btrace.
was:
HDDS-4320 implemented unbuffer API. However, I found it changed
BlockInputStream.close(). The close() call is supposed to close the enclosed
ChunkInputStreams, but that logic was moved to unbuffer() and close() never
close ChunkInputStreams, causing socket leak.
Running 100GB, SparkSQL TPC-DS workloads on a small 3-node cluster, the test
set couldn't complete because the processes failed with
"java.net.BindException: Cannot assign requested address" message. Further
investigation found the process created tens of thousands of sockets (lsof -p).
I was finally able to pinpoint the source of leak using btrace.
> unbuffer caused connection leak
> -------------------------------
>
> Key: HDDS-4815
> URL: https://issues.apache.org/jira/browse/HDDS-4815
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Client
> Affects Versions: 1.1.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Blocker
> Labels: pull-request-available
>
> HDDS-4320 implemented unbuffer API. However, I found it changed
> BlockInputStream.close(). The close() call is supposed to close the enclosed
> ChunkInputStreams, but that logic was moved to unbuffer() and close() never
> closes ChunkInputStreams, causing socket leak.
> Running a small 100GB SparkSQL TPC-DS workloads on a 3-node cluster, the test
> set couldn't complete because the processes failed with
> "java.net.BindException: Cannot assign requested address" message. Further
> investigation found the process created tens of thousands of sockets (lsof
> -p). I was finally able to pinpoint the source of leak using btrace.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]