xquyvu opened a new issue, #39030:
URL: https://github.com/apache/arrow/issues/39030
### Describe the bug, including details regarding any error messages,
version, and platform.
As described, when writing a file with `pyarrow.dataset.write_dataset`, the
order is not preserved. I have tested this with both `parquet` and `csv` file
format.
```python
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow.dataset
from pathlib import Path
data_load_path = './data.parquet'
pyarrow_dataset_write_path = './pyarrow_saved_data.parquet'
data = pd.DataFrame({'col': np.arange(1e7)})
data.to_parquet(data_load_path)
# Check if data loaded with pandas and pyarrow are the same
pyarrow_dataset = pyarrow.dataset.dataset(data_load_path, format='parquet')
pyarrow_dataset_df = pyarrow_dataset.to_table().to_pandas()
print((pyarrow_dataset_df['col'] == data['col']).all()) # True
# Write with pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset(
pyarrow_dataset,
pyarrow_dataset_write_path,
format='parquet',
)
loaded_pyarrow_dataset = pyarrow.dataset.dataset(pyarrow_dataset_write_path,
format='parquet')
loaded_pyarrow_dataset_df = loaded_pyarrow_dataset.to_table().to_pandas()
print((loaded_pyarrow_dataset_df['col'] == data['col']).all()) # False
print((loaded_pyarrow_dataset_df['col'] == data['col']).mean()) # 0.29
# Write with pq.write_to_dataset
pq.write_to_dataset(
pyarrow_dataset,
'x.parquet',
max_rows_per_group=1024 * 1024,
max_rows_per_file=10 * 1024 * 1024, # 10 row groups of default size.
existing_data_behavior='delete_matching'
)
(pyarrow.dataset.dataset('x.parquet').to_table().to_pandas()['col'] ==
data['col']).all() # True
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]