wjones127 commented on issue #33605:
URL: https://github.com/apache/arrow/issues/33605#issuecomment-1379284376
Hi @droher. I was unable to reproduce this with PyArrow 10.0.1. The data
read back from Parquet always is equal to the data originally read from the CSV
file.
Below is what I tried. Could you clarify how you checked the values in the
Parquet file?
```python
In [1]: from pyarrow import csv, parquet
...: import pyarrow.dataset as ds
In [2]: table = csv.read_csv(
...: "~/Downloads/event_baserunning_advance_attempt.csv.gz",
...: convert_options=csv.ConvertOptions(
...: strings_can_be_null=True,
...: )
...: )
...: table.validate(full=True)
In [3]: parquet.write_table(table, "test.parquet")
...: table_pq = parquet.read_table("test.parquet")
...: table_pq.validate(full=True)
In [4]: table.equals(table_pq)
Out[4]: True
In [5]: table.filter((ds.field("game_id") == "ANA201405230") &
(ds.field("event_id") == 50)).to_pandas()
Out[5]:
game_id event_id sequence_id baserunner attempted_advance_to
is_successful advanced_on_error_flag safe_on_error_flag rbi_flag
team_unearned_flag
0 ANA201405230 50 1 First Second
True False False False
False
In [6]: table_pq.filter((ds.field("game_id") == "ANA201405230") &
(ds.field("event_id") == 50)).to_pandas()
Out[6]:
game_id event_id sequence_id baserunner attempted_advance_to
is_successful advanced_on_error_flag safe_on_error_flag rbi_flag
team_unearned_flag
0 ANA201405230 50 1 First Second
True False False False
False
In [7]: parquet.write_table(table, "test.parquet", write_batch_size=10000)
...: table_pq = parquet.read_table("test.parquet")
...: table_pq.validate(full=True)
In [8]: table.equals(table_pq)
Out[8]: True
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]