austin3dickey commented on issue #38438:
URL: https://github.com/apache/arrow/issues/38438#issuecomment-1778015806

   Here is a more stripped-down example:
   ```python
   import pathlib
   import shutil
   import tempfile
   import uuid
   
   import pyarrow.dataset
   
   
   # First, from https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com/
   # download the '2009/01/data.parquet' through '2009/04/data.parquet' files.
   source_dir = pathlib.Path("/Users/austin/data/ursa-labs-taxi-data/")  # <- 
change this dir
   source_paths = list(source_dir.glob("**/*.parquet"))
   tempdir = tempfile.TemporaryDirectory()
   
   source_ds = pyarrow.dataset.dataset(
       source_paths,
       format="parquet",
       schema=pyarrow.dataset.dataset(source_paths[0], format="parquet").schema,
   )
   
   for n_rows in [561000, 5610000]:
       for serialization_format in ["parquet", "arrow", "feather", "csv"]:
           data = source_ds.head(
               n_rows,
               # # Uncomment this and the segfault does not happen!
               # 
fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
               #     pre_buffer=False
               # ),
           )
           out_dir = pathlib.Path(tempdir.name) / str(uuid.uuid4())
   
           # This is where the segfault happens
           print(f"Writing to {serialization_format}")
           pyarrow.dataset.write_dataset(
               data=data,
               format=serialization_format,
               base_dir=out_dir,
               existing_data_behavior="overwrite_or_ignore",
           )
           print("Done")
   
           shutil.rmtree(out_dir)
   ```
   
   I ran this a few times, and there are a mix of symptoms. See:
   ```
   > python ~/Desktop/test.py
   Writing to parquet
   [1]    47390 segmentation fault  python ~/Desktop/test.py
   
   > python ~/Desktop/test.py
   Writing to parquet
   Done
   Writing to arrow
   [1]    47400 bus error  python ~/Desktop/test.py
   
   > python ~/Desktop/test.py
   Writing to parquet
   [1]    47413 bus error  python ~/Desktop/test.py
   
   > python ~/Desktop/test.py
   Writing to parquet
   Done
   Writing to arrow
   [1]    47431 segmentation fault  python ~/Desktop/test.py
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to