droher opened a new issue, #33605:
URL: https://github.com/apache/arrow/issues/33605

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I found a case in which Pyarrow writes a Parquet file with incorrect boolean 
values.
   
   Using this CSV (9 million rows): 
https://pub-4550af7adec143a391557444b6e41067.r2.dev/event_baserunning_advance_attempt.csv.gz
   
   I can find at least a handful of boolean values that flip incorrectly when 
writing to Parquet, but not CSV.
   ```
   from pyarrow import csv, parquet
   
   table = csv.read_csv(
           "event_baserunning_advance_attempt.csv",
           convert_options=csv.ConvertOptions(
               strings_can_be_null=True,
           )
       )
     parquet.write_table(table, "test.parquet"), compression='zstd'), 
write_batch_size=1000000)
     csv.write_csv(table, "test.csv")
   ```
   For a specific example, line 8432295 in the original file reads:
   `ANA201405230,50,1,First,Second,true,false,false,false,false`
   The new CSV file translates it correctly:
   `"ANA201405230",50,1,"First","Second",true,false,false,false,false`
   But the Parquet file flips the first boolean value:
   `ANA201405230,50,1,First,Second,false,false,false,false,false`
   
    It looks like the issue can be resolved with a much larger 
`write_batch_size` than the default setting:
   ```
   parquet.write_table(table, "test.parquet", write_batch_size=1000000)
   ```
   The above change will output the correct row.
   
   The error appears to be deterministic, as the rows I've spot-checked are 
wrong each time, even with a variety of changes to settings and environments.
   This specific row is incorrect each time in each of the following 
environments/settings:
   - Pyarrow Versions: 8.0.0, 10.0.1
   - OS: 2020 Macbook M1 on MacOS 13.1, Ubuntu 22.04
   - Other `write_table` options tried: version, data_page_version, flavor, 
use_dictionary, compression
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to