droher opened a new issue, #33605: URL: https://github.com/apache/arrow/issues/33605
### Describe the bug, including details regarding any error messages, version, and platform. I found a case in which Pyarrow writes a Parquet file with incorrect boolean values. Using this CSV (9 million rows): https://pub-4550af7adec143a391557444b6e41067.r2.dev/event_baserunning_advance_attempt.csv.gz I can find at least a handful of boolean values that flip incorrectly when writing to Parquet, but not CSV. ``` from pyarrow import csv, parquet table = csv.read_csv( "event_baserunning_advance_attempt.csv", convert_options=csv.ConvertOptions( strings_can_be_null=True, ) ) parquet.write_table(table, "test.parquet"), compression='zstd'), write_batch_size=1000000) csv.write_csv(table, "test.csv") ``` For a specific example, line 8432295 in the original file reads: `ANA201405230,50,1,First,Second,true,false,false,false,false` The new CSV file translates it correctly: `"ANA201405230",50,1,"First","Second",true,false,false,false,false` But the Parquet file flips the first boolean value: `ANA201405230,50,1,First,Second,false,false,false,false,false` It looks like the issue can be resolved with a much larger `write_batch_size` than the default setting: ``` parquet.write_table(table, "test.parquet", write_batch_size=1000000) ``` The above change will output the correct row. The error appears to be deterministic, as the rows I've spot-checked are wrong each time, even with a variety of changes to settings and environments. This specific row is incorrect each time in each of the following environments/settings: - Pyarrow Versions: 8.0.0, 10.0.1 - OS: 2020 Macbook M1 on MacOS 13.1, Ubuntu 22.04 - Other `write_table` options tried: version, data_page_version, flavor, use_dictionary, compression ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
