adamreeve opened a new issue, #47027:
URL: https://github.com/apache/arrow/issues/47027

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I noticed a regression when upgrading from Arrow 19.0.1 to 20.0.0 and 
writing Parquet files with a repeated column. It appears that the change to 
enable writing the page index by default (#45249) has caused the logic for 
starting new data pages to change, and so page sizes can become very large, and 
can overflow int32.
   
   Repro code, from commit 
https://github.com/adamreeve/arrow/commit/8444bf63a596815f7af94d2631fa094efbe0cad7:
   ```C++
   TEST(TestColumnWriter, WriteLargeLists) {
     auto sink = CreateOutputStream();
     auto schema = std::static_pointer_cast<GroupNode>(GroupNode::Make(
         "schema", Repetition::REQUIRED,
         {
             GroupNode::Make(
                 "x", Repetition::OPTIONAL,
                 {
                     GroupNode::Make("list", Repetition::REPEATED,
                                     {
                                         schema::Float("element", 
Repetition::REQUIRED),
                                     },
                                     nullptr),
                 },
                 LogicalType::List()),
         }));
     auto properties = WriterProperties::Builder()
                           .disable_dictionary()
                           //->disable_write_page_index()
                           ->build();
     auto file_writer = ParquetFileWriter::Open(sink, schema, properties);
     auto rg_writer = file_writer->AppendRowGroup();
   
     constexpr int64_t num_rows = 1000 * 1000;
     constexpr int64_t num_list_elements = 1000;
   
     std::vector<int16_t> def_levels(num_list_elements);
     std::vector<int16_t> rep_levels(num_list_elements);
     std::vector<float> values(num_list_elements);
   
     for (int32_t i = 0; i < num_list_elements; ++i) {
       def_levels[i] = 2;
       rep_levels[i] = i == 0 ? 0 : 1;
     }
   
     auto col_writer = dynamic_cast<FloatWriter*>(rg_writer->NextColumn());
     for (int32_t i = 0; i < num_rows; i++) {
       random_numbers(num_list_elements, i, -100.0f, 100.0f, values.data());
       col_writer->WriteBatch(num_list_elements, def_levels.data(), 
rep_levels.data(),
                              values.data());
     }
     col_writer->Close();
   
     rg_writer->Close();
     file_writer->Close();
     ASSERT_OK_AND_ASSIGN(auto buffer, sink->Finish());
   }
   ```
   
   This runs with `disable_write_page_index()`, but otherwise crashes with:
   ```
   C++ exception with description "Uncompressed data page size overflows 
INT32_MAX. Size:4005000014" thrown in the test body.
   ```
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to