RizzoV opened a new issue, #37989: URL: https://github.com/apache/arrow/issues/37989
### Describe the bug, including details regarding any error messages, version, and platform. ### Issue Description (continuing from https://github.com/pandas-dev/pandas/issues/55296) `pyarrow.Table.from_pandas()` causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below. From the Reproducible Example: - 1st `pd.DataFrame.to_parquet()` call: ``` Line # Mem usage Increment Occurrences Line Contents ============================================================= 74 91.9 MiB 91.9 MiB 1 @profile 75 def convert_df_to_table(df: pd.DataFrame): 76 91.9 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema)) ``` - 2000th call: ``` Line # Mem usage Increment Occurrences Line Contents ============================================================= 74 140.1 MiB 140.1 MiB 1 @profile 75 def convert_df_to_table(df: pd.DataFrame): 76 140.1 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema)) ``` - 10000th call: ``` Line # Mem usage Increment Occurrences Line Contents ============================================================= 74 329.4 MiB 329.4 MiB 1 @profile 75 def convert_df_to_table(df: pd.DataFrame): 76 329.5 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema)) ``` ### Reproducible Example ```python import os import string import sys from random import choice, randint from uuid import uuid4 import pandas as pd import pyarrow as pa from memory_profiler import profile sample_schema = pa.struct( [ ("a", pa.string()), ( "b", pa.struct( [ ("ba", pa.list_(pa.string())), ("bc", pa.string()), ("bd", pa.string()), ("be", pa.list_(pa.string())), ( "bf", pa.list_( pa.struct( [ ( "bfa", pa.struct( [ ("bfaa", pa.string()), ("bfab", pa.string()), ("bfac", pa.string()), ("bfad", pa.float64()), ("bfae", pa.string()), ] ), ) ] ) ), ), ] ), ), ("c", pa.int64()), ("d", pa.int64()), ("e", pa.string()), ( "f", pa.struct( [ ("fa", pa.string()), ("fb", pa.string()), ("fc", pa.string()), ("fd", pa.string()), ("fe", pa.string()), ("ff", pa.string()), ("fg", pa.string()), ] ), ), ("g", pa.int64()), ] ) def generate_random_string(str_length: int) -> str: return "".join( [choice(string.ascii_lowercase + string.digits) for n in range(str_length)] ) @profile def convert_df_to_table(df: pd.DataFrame) -> None: table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema)) def generate_random_data(): return { "a": [generate_random_string(128)], "b": [ { "ba": [generate_random_string(128) for i in range(50)], "bc": generate_random_string(128), "bd": generate_random_string(128), "be": [generate_random_string(128) for i in range(50)], "bf": [ { "bfa": { "bfaa": generate_random_string(128), "bfab": generate_random_string(128), "bfac": generate_random_string(128), "bfad": randint(0, 2**32), "bfae": generate_random_string(128), } } ], } ], "c": [randint(0, 2**32)], "d": [randint(0, 2**32)], "e": [generate_random_string(128)], "f": [ { "fa": generate_random_string(128), "fb": generate_random_string(128), "fc": generate_random_string(128), "fd": generate_random_string(128), "fe": generate_random_string(128), "ff": generate_random_string(128), "fg": generate_random_string(128), } ], "g": [randint(0, 2**32)], } def main(): for i in range(10000): df = pd.DataFrame.from_dict(generate_random_data()) # pa.jemalloc_set_decay_ms(0) convert_df_to_table # memory leak if __name__ == "__main__": main() ``` ### Installed Versions <details> ``` INSTALLED VERSIONS ------------------ python : 3.10.9.final.0 python-bits : 64 OS : Darwin OS-release : 22.6.0 Version : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : it_IT.UTF-8 LOCALE : it_IT.UTF-8 pyarrow : 13.0.0 pandas : 2.1.1 numpy : 1.26.0 ``` </details> ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
