mapleFU commented on PR #33897:
URL: https://github.com/apache/arrow/pull/33897#issuecomment-1440131022
@westonpace Let me explain it detail here:
Parquet, which may contains multiple rowgroups, rowgroups may contain
multiple columns, and columns may contain multiple pages.
For a `ParquetFileWriter`, use can use `AppendBufferedRowGroup` and
`AppendRowGroup` to acquire a buffered and unbuffered rowgroup writer.
For buffered `RowGroupWriter`, it can:
```
for column in columns:
columns[i] = file_writer->AppendBufferedRowGroup()
for value in value-batches:
column_writer = columns[i]->column()
typed_column_writer->Write()
```
For unbuffered `RowGroupWriter`, it can:
```
for column in columns:
columns[i] = file_writer->AppendRowGroup()
writeAllValues(columns[i])
```
Below are user-known api which you may familiar with. Now let's explain the
interface of this patch, and why previous patch is trickey.
An active Column uses `(Typed)ColumnWriter` to write values, and
ColumnWriter holds a `PageWriter` to write page to sink. If no dict is
enabled, it would be like:
## Type1: non-buffered-non-dict:
```
Only Have One Active Column Writer
Column2: | DataPage1 | DataPage2 | Buffered-Values that didn't becoming a
page |
```
For it, the output bytes are:
1. `RowGroupWriter.total_compressed_bytes`: 0
2.`RowGroupWriter.total_bytes_written`: It **doesn't** mean the
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. In fact, it means
the **uncompressed** `Column1-IO-Bytes + Column2-DataPage1 + Column2-DataPage2`
if compression like zstd enabled
3. (New in this patch) `RowGroupWriter.total_bytes_written`: It means the
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. It will be a
little less than really io-bytes, because page and column may write some
metadata. But they will be close
4. `ColumnWriter.total_compressed_bytes`: 0
5. `ColumnWriter.total_bytes_written`: **uncompressed** `Column2-DataPage1 +
Column2-DataPage2`
6. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: It
means the `Column2-DataPage1 + Column2-DataPage2`.
## Type-2 non-buffered-dict:
Now, things would be a little different. `ColumnWriter` has a `data_pages_`
vector for pages.
```
Only Have One Active Column Writer
Column2: | Buffered-Values that didn't becoming a page |
* Column2: buffers DataPage1 and DataPage2
* Column2: Building Dictionary Page
```
Now, let's revisit the ColumnWriter's size:
1. `ColumnWriter.total_compressed_bytes`: **compressed** `Column2-DataPage1
+ Column2-DataPage2`
2. `ColumnWriter.total_bytes_written`: 0
3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
## Type3: buffered-non-dict:
```
Have Two Active Column Writer
Column2: | DataPage1 | Buffered-Values that didn't becoming a page |
Column2: | DataPage1 | DataPage2 | Buffered-Values that didn't becoming a
page |
```
1. `ColumnWriter.total_compressed_bytes`: Compressed `c1.DataPage1 +
c2.DataPage1 + c3.DataPage2`
2. `ColumnWriter.total_bytes_written`: 0
3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]