[
https://issues.apache.org/jira/browse/ARROW-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550238#comment-17550238
]
David Li commented on ARROW-16697:
----------------------------------
{quote}So let's say with 64 concurrent clients, the high watermark goes up (4GB
no problem, running with 64 clients for longer, i was able to surpass 10GB).
Perhaps some gRPC behavior + overhead combines with malloc all contribute into
how high the memory usage can climb?
{quote}
Interesting. I didn't dig deep enough myself but it could certainly be gRPC
behavior there. That would also explain the asymmetry between DoPut (server)
and DoGet (client). If you see unexpected behavior I can dig further into that.
I'm not sure off the top of my head how gRPC manages allocations and I wouldn't
be surprised if it did its own allocation and/or buffering. I wonder if, on the
server, shutting down and restarting the server (and maybe calling
{{malloc_trim}} in between) might reduce the high water mark (in which case it
would probably explain the rest of the allocated memory). For instance I
believe there is an unbounded thread pool used by default and I'm not sure if
this thread pool ever shrinks.
On the client side: AIUI, gRPC maintains some global state, and I'm not sure
all of it gets freed without an explicit {{{}grpc_shutdown{}}}. (That applies
to the server too, really.)
> [FlightRPC][Python] Server seems to leak memory during DoPut
> ------------------------------------------------------------
>
> Key: ARROW-16697
> URL: https://issues.apache.org/jira/browse/ARROW-16697
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Lubo Slivka
> Assignee: David Li
> Priority: Major
> Attachments: leak_repro_client.py, leak_repro_server.py, sample.csv.gz
>
>
> Hello,
> We are stress testing our Flight RPC server (PyArrow 8.0.0) with write-heavy
> workloads and are running into what appear to be memory leaks.
> The server is under pressure by a number of separate clients doing DoPut.
> What we are seeing is that server's memory usage only ever goes up until the
> server finally gets whacked by k8s due to hitting memory limit.
> I have spent many hours fishing through our code for memory leaks with no
> success. Even short-circuiting all our custom DoPut handling logic does not
> alleviate the situation. This led me to create a reproducer that uses nothing
> but PyArrow and I see the server process memory only increasing similar to
> what we see on our servers.
> The reproducer is in attachments + I included the test CSV file (20MB) that I
> use for my tests. Few notes:
> * The client code has multiple threads, each emulating a separate Flight
> Client
> * There are two variants where I see slightly different memory usage
> characteristic:
> ** _do_put_with_client_reuse << one client opened at start of thread, then
> hammering many puts, finally closing the client; leaks appear to happen
> faster in this variant
> ** _do_put_with_client_per_request << client opens & connects, does put,
> then disconnects; loop like this many times; leaks appear to happen slower in
> this variant if there are less concurrent clients; increasing number of
> threads 'helps'
> * The server code handling do_put reads batch-by-batch & does nothing with
> the chunks
> Also one interesting (but highly likely unrelated thing) that I keep noticing
> is that _sometimes_ FlightClient takes long time to close (like 5seconds). It
> happens intermittently.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)