adamreeve opened a new issue, #47027: URL: https://github.com/apache/arrow/issues/47027
### Describe the bug, including details regarding any error messages, version, and platform. I noticed a regression when upgrading from Arrow 19.0.1 to 20.0.0 and writing Parquet files with a repeated column. It appears that the change to enable writing the page index by default (#45249) has caused the logic for starting new data pages to change, and so page sizes can become very large, and can overflow int32. Repro code, from commit https://github.com/adamreeve/arrow/commit/8444bf63a596815f7af94d2631fa094efbe0cad7: ```C++ TEST(TestColumnWriter, WriteLargeLists) { auto sink = CreateOutputStream(); auto schema = std::static_pointer_cast<GroupNode>(GroupNode::Make( "schema", Repetition::REQUIRED, { GroupNode::Make( "x", Repetition::OPTIONAL, { GroupNode::Make("list", Repetition::REPEATED, { schema::Float("element", Repetition::REQUIRED), }, nullptr), }, LogicalType::List()), })); auto properties = WriterProperties::Builder() .disable_dictionary() //->disable_write_page_index() ->build(); auto file_writer = ParquetFileWriter::Open(sink, schema, properties); auto rg_writer = file_writer->AppendRowGroup(); constexpr int64_t num_rows = 1000 * 1000; constexpr int64_t num_list_elements = 1000; std::vector<int16_t> def_levels(num_list_elements); std::vector<int16_t> rep_levels(num_list_elements); std::vector<float> values(num_list_elements); for (int32_t i = 0; i < num_list_elements; ++i) { def_levels[i] = 2; rep_levels[i] = i == 0 ? 0 : 1; } auto col_writer = dynamic_cast<FloatWriter*>(rg_writer->NextColumn()); for (int32_t i = 0; i < num_rows; i++) { random_numbers(num_list_elements, i, -100.0f, 100.0f, values.data()); col_writer->WriteBatch(num_list_elements, def_levels.data(), rep_levels.data(), values.data()); } col_writer->Close(); rg_writer->Close(); file_writer->Close(); ASSERT_OK_AND_ASSIGN(auto buffer, sink->Finish()); } ``` This runs with `disable_write_page_index()`, but otherwise crashes with: ``` C++ exception with description "Uncompressed data page size overflows INT32_MAX. Size:4005000014" thrown in the test body. ``` ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org