[jira] [Updated] (ARROW-10417) [Python][C++] Possible Memory Leak in RecordBatchStreamWriter

Shouheng Yi (Jira) Wed, 28 Oct 2020 22:03:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shouheng Yi updated ARROW-10417:
--------------------------------
    Environment: 
This is the config for my worker node:
resources:
    - cpus: 1
    - maxMemoryMb: 4096
    - reservedMemoryMb: 2048

  was:
This is the config for my worker node:

{code:yaml}
  resources:
    cpus: 1
    maxMemoryMb: 4096
    reservedMemoryMb: 2048
{code}



> [Python][C++] Possible Memory Leak in RecordBatchStreamWriter
> -------------------------------------------------------------
>
>                 Key: ARROW-10417
>                 URL: https://issues.apache.org/jira/browse/ARROW-10417
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.1
>         Environment: This is the config for my worker node:
> resources:
>     - cpus: 1
>     - maxMemoryMb: 4096
>     - reservedMemoryMb: 2048
>            Reporter: Shouheng Yi
>            Priority: Major
>         Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot 
> 2020-10-28 at 9.43.40 PM.png
>
>
> There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory 
> resources were not released. It always hit the memory limit and started doing 
> virtual memory swapping. See the picture below:
> !Screen Shot 2020-10-28 at 9.43.32 PM.png!
> This was the code:
> {code:python}
> import tempfile
> import os
> import sys
> import pyarrow as pa
> B = 1
> KB = 1024 * B
> MB = 1024 * KB
> schema = pa.schema(
>     [
>         pa.field("a_string", pa.string()),
>         pa.field("an_int", pa.int32()),
>         pa.field("a_float", pa.float32()),
>         pa.field("a_list_of_floats", pa.list_(pa.float32())),
>     ]
> )
> nrows_in_a_batch = 1000
> nbatches_in_a_table = 1000
> column_arrays = [
>     ["string"] * nrows_in_a_batch,
>     [123] * nrows_in_a_batch,
>     [456.789] * nrows_in_a_batch,
>     [range(1000)] * nrows_in_a_batch,
> ]
> def main(sys_args) -> None:
>     batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
>     table = pa.Table.from_batches([batch] * nbatches_in_a_table, 
> schema=schema)
>     with tempfile.TemporaryDirectory() as tmpdir:
>         filename_template = "file-{n}.arror"
>         i = 0
>         while True:
>             path = os.path.join(tmpdir, filename_template.format(n=i))
>             i += 1
>             with pa.OSFile(path, "w") as sink:
>                 with pa.RecordBatchStreamWriter(sink, schema) as writer:
>                     writer.write_table(table)
>                     print(f"pa.total_allocated_bytes(): 
> {pa.total_allocated_bytes() / MB} mb")
> if __name__ == "__main__":
>     main(sys.argv[1:])
> {code}
> Strangely enough, using {{total_allocated_bytes}}, it seemed normal.
> {code:python}
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> {code}
> Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release 
> the resources?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10417) [Python][C++] Possible Memory Leak in RecordBatchStreamWriter

Reply via email to