jmahlik commented on issue #36980:
URL: https://github.com/apache/arrow/issues/36980#issuecomment-1892559630

   Still reproduces on pyarrow==14.0.2 and python 3.12.1.
   
   It seems like a similar fix to https://github.com/apache/arrow/pull/38637 
could be applied to the [numpy 
buffer](https://github.com/apache/arrow/blob/7acbaf45ce2d5be31e70b552d1a24476c67383e6/python/pyarrow/src/arrow/python/numpy_convert.h#L42)
 to ensure they only attempt to call the destructor if python is initialized.
   
   When I alter the example to run a full gc collect to clean up the numpy data 
object before shutdown, it no longer segfaults.
   
   ```python
   
   import tempfile
   import gc
   import string
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   
   NUM_ROWS = 10_000
   
   
   def file_visitor(written_file):
       print(f"path={written_file.path}")
       print(f"size={written_file.size} bytes")
       print(f"metadata={written_file.metadata}")
   
   
   def main() -> int:
       df = pd.DataFrame(
           {
               "float": np.random.rand(NUM_ROWS),
               "int": np.random.randint(0, 10000, size=NUM_ROWS),
               "string": np.random.choice(list(string.ascii_letters), 
size=NUM_ROWS),
           }
       )
       data = pa.Table.from_pandas(
           df,
       )
   
       print(pa.show_info())
       with tempfile.TemporaryDirectory() as t:
           ds.write_dataset(
               data=data,
               format="parquet",
               base_dir=t,
               max_rows_per_file=1000,
               max_rows_per_group=1000,
               # So we can see all the files being written
               file_visitor=file_visitor,
               use_threads=True,
           )
           # This is critical to clean up the numpy buffers
           # and ensure their destructor is called before interpreter shutdown
           del data
           gc.collect()
   
           print(f"Wrote data to {t}")
       return 0
   
   
   if __name__ == "__main__":
       raise SystemExit(main())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to