mapleFU opened a new issue, #39210:
URL: https://github.com/apache/arrow/issues/39210
### Describe the enhancement requested
Arrow Parquet Writer uses `AppendRowGroup` to produce the RowGroup during
writing a Parquet file. The code path is listed in
`FileWriterImpl::WriteRecordBatch` as below:
```c++
while (offset < batch.num_rows()) {
const int64_t batch_size =
std::min(max_row_group_length - row_group_writer_->num_rows(),
batch.num_rows() - offset);
RETURN_NOT_OK(WriteBatch(offset, batch_size));
offset += batch_size;
// Flush current row group if it is full.
if (row_group_writer_->num_rows() >= max_row_group_length) {
RETURN_NOT_OK(NewBufferedRowGroup());
}
}
```
Assume the `max_row_group_length == k`, if input `recordBatch.num_rows() % k
== 0`, we would append an empty row-group.
The behavior of empty row-group is a bit tricky. It looks like below. The
empty row-group only have metadata, and data does not exists.
```
{
"Id": "2", "TotalBytes": "0", "TotalCompressedBytes": "0", "Rows":
"0",
"ColumnChunks": [
{"Id": "0", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "1", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "2", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "3", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "4", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "5", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "6", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "7", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "8", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "9", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "10", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" },
{"Id": "11", "Values": "0", "StatsSet": "False",
"Compression": "UNCOMPRESSED", "Encodings": "",
"UncompressedSize": "0", "CompressedSize": "0" }
]
}
```
Although it's hard to prevent all cases from "empty row-group", maybe we can
prevent the case for the last write to the File
### Component(s)
C++, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]