HuaHuaY opened a new issue, #47664:
URL: https://github.com/apache/arrow/issues/47664

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `nums_rows()` method is inherited from `ParquetFileWriter`. It seems that 
the method is designed to count the number of rows of RowGroups which are 
closed.
   
   ```cpp
   /// Number of rows in the yet started RowGroups.
   ///
   /// Changes on the addition of a new RowGroup.
   int64_t num_rows() const;
   ```
   
   But in the implementation of `class FileSerializer : public 
ParquetFileWriter::Contents`, the `num_rows_` variable is only modified in 
`Close` method. It means that `num_rows()` method always returns `0` before 
`Close` is called.
   And after `Close` of `ParquetFileWriter` is called we can no longer call 
`nums_rows()` method, because `contents_` is reset in `Close` and we will get a 
null pointer exception if we call `nums_rows()`.
   
   So, in the current implementation of `FileSerializer`, `num_rows()` will 
only return `0` or throw exception.
   
   ```cpp
   int ParquetFileWriter::num_columns() const { return 
contents_->num_columns(); }
   
   void ParquetFileWriter::Close() {
     if (contents_) {
       contents_->Close();
       file_metadata_ = contents_->metadata();
       contents_.reset();
     }
   }
   
   // The following is the code of `FileSerializer`.
   
   int64_t num_rows() const override { return num_rows_; }
   
   void Close() override {
       if (is_open_) {
         if (row_group_writer_) {
           num_rows_ += row_group_writer_->num_rows();    // <- num_rows is 
only modified here and can never be read.
           ...
         }
       ...
   }
   
   RowGroupWriter* AppendRowGroup(bool buffered_row_group) {
     if (row_group_writer_) {
       // <- A possible missing `num_rows_ += row_group_writer_->num_rows()` 
here.
       row_group_writer_->Close();
     }
     ...
   }
   ```
   
   Should we mark the virtual method in `ParquetFileWriter` deprecated and 
remove the method in next release? I guess the method which is used to count 
the row number during writing row groups is useless and our default 
implementation always returns 0 now. Otherwise we can fix the issue by 
following the comments above.
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to