[GitHub] [arrow] wjones127 commented on issue #33605: [Python] Parquet file writes incorrect booleans on large file with default write batch size

GitBox Wed, 11 Jan 2023 10:05:52 -0800


wjones127 commented on issue #33605:
URL: https://github.com/apache/arrow/issues/33605#issuecomment-1379284376


   Hi @droher. I was unable to reproduce this with PyArrow 10.0.1. The data 
read back from Parquet always is equal to the data originally read from the CSV 
file.
   
   Below is what I tried. Could you clarify how you checked the values in the 
Parquet file?
   
   ```python
   In [1]: from pyarrow import csv, parquet
      ...: import pyarrow.dataset as ds
   
   In [2]: table = csv.read_csv(
      ...:     "~/Downloads/event_baserunning_advance_attempt.csv.gz",
      ...:     convert_options=csv.ConvertOptions(
      ...:         strings_can_be_null=True,
      ...:     )
      ...: )
      ...: table.validate(full=True)
   
   In [3]: parquet.write_table(table, "test.parquet")
      ...: table_pq = parquet.read_table("test.parquet")
      ...: table_pq.validate(full=True)
   
   In [4]: table.equals(table_pq)
   Out[4]: True
   
   In [5]: table.filter((ds.field("game_id") == "ANA201405230") & 
(ds.field("event_id") == 50)).to_pandas()
   Out[5]:
           game_id  event_id  sequence_id baserunner attempted_advance_to  
is_successful  advanced_on_error_flag  safe_on_error_flag  rbi_flag  
team_unearned_flag
   0  ANA201405230        50            1      First               Second       
    True                   False               False     False               
False
   
   In [6]: table_pq.filter((ds.field("game_id") == "ANA201405230") & 
(ds.field("event_id") == 50)).to_pandas()
   Out[6]:
           game_id  event_id  sequence_id baserunner attempted_advance_to  
is_successful  advanced_on_error_flag  safe_on_error_flag  rbi_flag  
team_unearned_flag
   0  ANA201405230        50            1      First               Second       
    True                   False               False     False               
False
   
   In [7]: parquet.write_table(table, "test.parquet", write_batch_size=10000)
      ...: table_pq = parquet.read_table("test.parquet")
      ...: table_pq.validate(full=True)
   
   In [8]: table.equals(table_pq)
   Out[8]: True
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on issue #33605: [Python] Parquet file writes incorrect booleans on large file with default write batch size

Reply via email to