[ https://issues.apache.org/jira/browse/ARROW-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shouheng Yi updated ARROW-10417: -------------------------------- Environment: This is the config for my worker node: resources: - cpus: 1 - maxMemoryMb: 4096 - reservedMemoryMb: 2048 was: This is the config for my worker node: {code:yaml} resources: cpus: 1 maxMemoryMb: 4096 reservedMemoryMb: 2048 {code} > [Python][C++] Possible Memory Leak in RecordBatchStreamWriter > ------------------------------------------------------------- > > Key: ARROW-10417 > URL: https://issues.apache.org/jira/browse/ARROW-10417 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.15.1 > Environment: This is the config for my worker node: > resources: > - cpus: 1 > - maxMemoryMb: 4096 > - reservedMemoryMb: 2048 > Reporter: Shouheng Yi > Priority: Major > Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot > 2020-10-28 at 9.43.40 PM.png > > > There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory > resources were not released. It always hit the memory limit and started doing > virtual memory swapping. See the picture below: > !Screen Shot 2020-10-28 at 9.43.32 PM.png! > This was the code: > {code:python} > import tempfile > import os > import sys > import pyarrow as pa > B = 1 > KB = 1024 * B > MB = 1024 * KB > schema = pa.schema( > [ > pa.field("a_string", pa.string()), > pa.field("an_int", pa.int32()), > pa.field("a_float", pa.float32()), > pa.field("a_list_of_floats", pa.list_(pa.float32())), > ] > ) > nrows_in_a_batch = 1000 > nbatches_in_a_table = 1000 > column_arrays = [ > ["string"] * nrows_in_a_batch, > [123] * nrows_in_a_batch, > [456.789] * nrows_in_a_batch, > [range(1000)] * nrows_in_a_batch, > ] > def main(sys_args) -> None: > batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema) > table = pa.Table.from_batches([batch] * nbatches_in_a_table, > schema=schema) > with tempfile.TemporaryDirectory() as tmpdir: > filename_template = "file-{n}.arror" > i = 0 > while True: > path = os.path.join(tmpdir, filename_template.format(n=i)) > i += 1 > with pa.OSFile(path, "w") as sink: > with pa.RecordBatchStreamWriter(sink, schema) as writer: > writer.write_table(table) > print(f"pa.total_allocated_bytes(): > {pa.total_allocated_bytes() / MB} mb") > if __name__ == "__main__": > main(sys.argv[1:]) > {code} > Strangely enough, using {{total_allocated_bytes}}, it seemed normal. > {code:python} > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > {code} > Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release > the resources? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)