[I] pyarrow.dataset.write_dataset do not preserve order [arrow]

via GitHub Fri, 01 Dec 2023 06:14:02 -0800


xquyvu opened a new issue, #39030:
URL: https://github.com/apache/arrow/issues/39030


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   As described, when writing a file with `pyarrow.dataset.write_dataset`, the 
order is not preserved. I have tested this with both `parquet` and `csv` file 
format.
   
   ```python
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import pyarrow.dataset
   from pathlib import Path
   
   
   data_load_path = './data.parquet'
   pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'
   
   data = pd.DataFrame({'col': np.arange(1e7)})
   data.to_parquet(data_load_path)
   
   # Check if data loaded with pandas and pyarrow are the same
   pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
   pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()
   
   print((pyarrow_dataset_df['col'] == data['col']).all()) # True
   
   # Write with pyarrow.dataset.write_dataset
   pyarrow.dataset.write_dataset(
       pyarrow_dataset,
       pyarrow_dataset_write_path,
       format='parquet',
   )
   
   loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path, 
format='parquet')
   loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
   print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
   print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29
   
   # Write with pq.write_to_dataset
   pq.write_to_dataset(
       pyarrow_dataset,
       'x.parquet',
       max_rows_per_group=1024 * 1024,
       max_rows_per_file=10 * 1024 * 1024,  # 10 row groups of default size.
       existing_data_behavior='delete_matching'
   )
   
   (pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] == 
data['col']).all() # True
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] pyarrow.dataset.write_dataset do not preserve order [arrow]

Reply via email to