jmahlik commented on issue #36980: URL: https://github.com/apache/arrow/issues/36980#issuecomment-1892559630
Still reproduces on pyarrow==14.0.2 and python 3.12.1. It seems like a similar fix to https://github.com/apache/arrow/pull/38637 could be applied to the [numpy buffer](https://github.com/apache/arrow/blob/7acbaf45ce2d5be31e70b552d1a24476c67383e6/python/pyarrow/src/arrow/python/numpy_convert.h#L42) to ensure they only attempt to call the destructor if python is initialized. When I alter the example to run a full gc collect to clean up the numpy data object before shutdown, it no longer segfaults. ```python import tempfile import gc import string import numpy as np import pandas as pd import pyarrow as pa import pyarrow.dataset as ds NUM_ROWS = 10_000 def file_visitor(written_file): print(f"path={written_file.path}") print(f"size={written_file.size} bytes") print(f"metadata={written_file.metadata}") def main() -> int: df = pd.DataFrame( { "float": np.random.rand(NUM_ROWS), "int": np.random.randint(0, 10000, size=NUM_ROWS), "string": np.random.choice(list(string.ascii_letters), size=NUM_ROWS), } ) data = pa.Table.from_pandas( df, ) print(pa.show_info()) with tempfile.TemporaryDirectory() as t: ds.write_dataset( data=data, format="parquet", base_dir=t, max_rows_per_file=1000, max_rows_per_group=1000, # So we can see all the files being written file_visitor=file_visitor, use_threads=True, ) # This is critical to clean up the numpy buffers # and ensure their destructor is called before interpreter shutdown del data gc.collect() print(f"Wrote data to {t}") return 0 if __name__ == "__main__": raise SystemExit(main()) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
