austin3dickey commented on issue #38438: URL: https://github.com/apache/arrow/issues/38438#issuecomment-1778015806
Here is a more stripped-down example: ```python import pathlib import shutil import tempfile import uuid import pyarrow.dataset # First, from https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com/ # download the '2009/01/data.parquet' through '2009/04/data.parquet' files. source_dir = pathlib.Path("/Users/austin/data/ursa-labs-taxi-data/") # <- change this dir source_paths = list(source_dir.glob("**/*.parquet")) tempdir = tempfile.TemporaryDirectory() source_ds = pyarrow.dataset.dataset( source_paths, format="parquet", schema=pyarrow.dataset.dataset(source_paths[0], format="parquet").schema, ) for n_rows in [561000, 5610000]: for serialization_format in ["parquet", "arrow", "feather", "csv"]: data = source_ds.head( n_rows, # # Uncomment this and the segfault does not happen! # fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions( # pre_buffer=False # ), ) out_dir = pathlib.Path(tempdir.name) / str(uuid.uuid4()) # This is where the segfault happens print(f"Writing to {serialization_format}") pyarrow.dataset.write_dataset( data=data, format=serialization_format, base_dir=out_dir, existing_data_behavior="overwrite_or_ignore", ) print("Done") shutil.rmtree(out_dir) ``` I ran this a few times, and there are a mix of symptoms. See: ``` > python ~/Desktop/test.py Writing to parquet [1] 47390 segmentation fault python ~/Desktop/test.py > python ~/Desktop/test.py Writing to parquet Done Writing to arrow [1] 47400 bus error python ~/Desktop/test.py > python ~/Desktop/test.py Writing to parquet [1] 47413 bus error python ~/Desktop/test.py > python ~/Desktop/test.py Writing to parquet Done Writing to arrow [1] 47431 segmentation fault python ~/Desktop/test.py ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
