[jira] [Created] (ARROW-10417) [Python][C++] Possible Memory Leak in RecordBatchStreamWriter

Shouheng Yi (Jira) Wed, 28 Oct 2020 21:57:01 -0700

Shouheng Yi created ARROW-10417:
-----------------------------------

             Summary: [Python][C++] Possible Memory Leak in 
RecordBatchStreamWriter
                 Key: ARROW-10417
                 URL: https://issues.apache.org/jira/browse/ARROW-10417
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.15.1
         Environment: This is the config for my worker node:


{code:yaml}
  resources:
    cpus: 1
    maxMemoryMb: 4096
    reservedMemoryMb: 2048
{code}

            Reporter: Shouheng Yi
         Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot 
2020-10-28 at 9.43.40 PM.png

There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory 
resources were not released. See the picture below:

!Screen Shot 2020-10-28 at 9.43.32 PM.png!

This was the code:
{code:python}
import tempfile
import os
import sys

import pyarrow as pa

B = 1
KB = 1024 * B
MB = 1024 * KB

schema = pa.schema(
    [
        pa.field("a_string", pa.string()),
        pa.field("an_int", pa.int32()),
        pa.field("a_float", pa.float32()),
        pa.field("a_list_of_floats", pa.list_(pa.float32())),
    ]
)

nrows_in_a_batch = 1000
nbatches_in_a_table = 1000

column_arrays = [
    ["string"] * nrows_in_a_batch,
    [123] * nrows_in_a_batch,
    [456.789] * nrows_in_a_batch,
    [range(1000)] * nrows_in_a_batch,
]

def main(sys_args) -> None:
    batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
    table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)

    with tempfile.TemporaryDirectory() as tmpdir:
        filename_template = "file-{n}.arror"
        i = 0

        while True:
            path = os.path.join(tmpdir, filename_template.format(n=i))
            i += 1

            with pa.OSFile(path, "w") as sink:
                with pa.RecordBatchStreamWriter(sink, schema) as writer:
                    writer.write_table(table)
                    print(f"pa.total_allocated_bytes(): 
{pa.total_allocated_bytes() / MB} mb")

if __name__ == "__main__":
    main(sys.argv[1:])
{code}
Strangely enough, using `total_allocated_bytes`, it seems normal.
{code:python}
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
pa.total_allocated_bytes(): 3.95556640625 mb
{code}

Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release 
the resources?

Thank you.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10417) [Python][C++] Possible Memory Leak in RecordBatchStreamWriter

Reply via email to