RizzoV opened a new issue, #37989:
URL: https://github.com/apache/arrow/issues/37989

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ### Issue Description
   
   (continuing from https://github.com/pandas-dev/pandas/issues/55296)
   
   `pyarrow.Table.from_pandas()` causes a memory leak on DataFrames containing 
nested structs. A sample problematic data schema and a compliant data generator 
is included in the Reproducible Example below.
   
   From the Reproducible Example:
   
   - 1st `pd.DataFrame.to_parquet()` call:
   ```
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       74     91.9 MiB     91.9 MiB           1   @profile
       75                                         def convert_df_to_table(df: 
pd.DataFrame):
       76     91.9 MiB      0.0 MiB           1       table = 
pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
   ```
   
   - 2000th call:
   ```
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       74    140.1 MiB    140.1 MiB           1   @profile
       75                                         def convert_df_to_table(df: 
pd.DataFrame):
       76    140.1 MiB      0.0 MiB           1       table = 
pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
   ```
   
   - 10000th call:
   ```
   Line #    Mem usage    Increment  Occurrences   Line Contents
   =============================================================
       74    329.4 MiB    329.4 MiB           1   @profile
       75                                         def convert_df_to_table(df: 
pd.DataFrame):
       76    329.5 MiB      0.0 MiB           1       table = 
pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
   ```
   
   ### Reproducible Example
   
   ```python
   import os
   import string
   import sys
   from random import choice, randint
   from uuid import uuid4
   
   import pandas as pd
   import pyarrow as pa
   from memory_profiler import profile
   
   sample_schema = pa.struct(
       [
           ("a", pa.string()),
           (
               "b",
               pa.struct(
                   [
                       ("ba", pa.list_(pa.string())),
                       ("bc", pa.string()),
                       ("bd", pa.string()),
                       ("be", pa.list_(pa.string())),
                       (
                           "bf",
                           pa.list_(
                               pa.struct(
                                   [
                                       (
                                           "bfa",
                                           pa.struct(
                                               [
                                                   ("bfaa", pa.string()),
                                                   ("bfab", pa.string()),
                                                   ("bfac", pa.string()),
                                                   ("bfad", pa.float64()),
                                                   ("bfae", pa.string()),
                                               ]
                                           ),
                                       )
                                   ]
                               )
                           ),
                       ),
                   ]
               ),
           ),
           ("c", pa.int64()),
           ("d", pa.int64()),
           ("e", pa.string()),
           (
               "f",
               pa.struct(
                   [
                       ("fa", pa.string()),
                       ("fb", pa.string()),
                       ("fc", pa.string()),
                       ("fd", pa.string()),
                       ("fe", pa.string()),
                       ("ff", pa.string()),
                       ("fg", pa.string()),
                   ]
               ),
           ),
           ("g", pa.int64()),
       ]
   )
   
   
   def generate_random_string(str_length: int) -> str:
       return "".join(
           [choice(string.ascii_lowercase + string.digits) for n in 
range(str_length)]
       )
   
   
   @profile
   def convert_df_to_table(df: pd.DataFrame) -> None:
        table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
   
   
   def generate_random_data():
       return {
           "a": [generate_random_string(128)],
           "b": [
               {
                   "ba": [generate_random_string(128) for i in range(50)],
                   "bc": generate_random_string(128),
                   "bd": generate_random_string(128),
                   "be": [generate_random_string(128) for i in range(50)],
                   "bf": [
                       {
                           "bfa": {
                               "bfaa": generate_random_string(128),
                               "bfab": generate_random_string(128),
                               "bfac": generate_random_string(128),
                               "bfad": randint(0, 2**32),
                               "bfae": generate_random_string(128),
                           }
                       }
                   ],
               }
           ],
           "c": [randint(0, 2**32)],
           "d": [randint(0, 2**32)],
           "e": [generate_random_string(128)],
           "f": [
               {
                   "fa": generate_random_string(128),
                   "fb": generate_random_string(128),
                   "fc": generate_random_string(128),
                   "fd": generate_random_string(128),
                   "fe": generate_random_string(128),
                   "ff": generate_random_string(128),
                   "fg": generate_random_string(128),
               }
           ],
           "g": [randint(0, 2**32)],
       }
   
   
   def main():
       for i in range(10000):
           df = pd.DataFrame.from_dict(generate_random_data())
           # pa.jemalloc_set_decay_ms(0)
           convert_df_to_table  # memory leak
   
   
   if __name__ == "__main__":
       main()
   ```
   
   ### Installed Versions
   
   <details>
   
   
   ```
   INSTALLED VERSIONS
   ------------------
   python              : 3.10.9.final.0
   python-bits         : 64
   OS                  : Darwin
   OS-release          : 22.6.0
   Version             : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 
2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
   machine             : x86_64
   processor           : i386
   byteorder           : little
   LC_ALL              : None
   LANG                : it_IT.UTF-8
   LOCALE              : it_IT.UTF-8
   
   pyarrow             : 13.0.0
   pandas              : 2.1.1
   numpy               : 1.26.0
   ```
   
   </details>
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to