Shouheng Yi created ARROW-10417:
-----------------------------------
Summary: [Python][C++] Possible Memory Leak in
RecordBatchStreamWriter
Key: ARROW-10417
URL: https://issues.apache.org/jira/browse/ARROW-10417
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 0.15.1
Environment: This is the config for my worker node:
{code:yaml}
resources:
cpus: 1
maxMemoryMb: 4096
reservedMemoryMb: 2048
{code}
Reporter: Shouheng Yi
Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot
2020-10-28 at 9.43.40 PM.png
There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory
resources were not released. See the picture below:
!Screen Shot 2020-10-28 at 9.43.32 PM.png!
This was the code:
{code:python}
import tempfile
import os
import sys
import pyarrow as pa
B = 1
KB = 1024 * B
MB = 1024 * KB
schema = pa.schema(
[
pa.field("a_string", pa.string()),
pa.field("an_int", pa.int32()),
pa.field("a_float", pa.float32()),
pa.field("a_list_of_floats", pa.list_(pa.float32())),
]
)
nrows_in_a_batch = 1000
nbatches_in_a_table = 1000
column_arrays = [
["string"] * nrows_in_a_batch,
[123] * nrows_in_a_batch,
[456.789] * nrows_in_a_batch,
[range(1000)] * nrows_in_a_batch,
]
def main(sys_args) -> None:
batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)
with tempfile.TemporaryDirectory() as tmpdir:
filename_template = "file-{n}.arror"
i = 0
while True:
path = os.path.join(tmpdir, filename_template.format(n=i))
i += 1
with pa.OSFile(path, "w") as sink:
with pa.RecordBatchStreamWriter(sink, schema) as writer:
writer.write_table(table)
print(f"pa.total_allocated_bytes():
{pa.total_allocated_bytes() / MB} mb")
if __name__ == "__main__":
main(sys.argv[1:])
{code}
Strangely enough, using `total_allocated_bytes`, it seems normal.
{code:python}
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
{code}
Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release
the resources?
Thank you.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)