wgtmac commented on PR #33897:
URL: https://github.com/apache/arrow/pull/33897#issuecomment-1445757905

   > @westonpace Let me explain it detail here:
   > 
   > Parquet, which may contains multiple rowgroups, rowgroups may contain 
multiple columns, and columns may contain multiple pages.
   > 
   > For a `ParquetFileWriter`, use can use `AppendBufferedRowGroup` and 
`AppendRowGroup` to acquire a buffered and unbuffered rowgroup writer.
   > 
   > For buffered `RowGroupWriter`, it can:
   > 
   > ```
   > for column in columns:
   >   columns[i] = file_writer->AppendBufferedRowGroup()
   > 
   > for value in value-batches:
   >   column_writer = columns[i]->column()
   >   typed_column_writer->Write()
   > ```
   > 
   > For unbuffered `RowGroupWriter`, it can:
   > 
   > ```
   > for column in columns:
   >   columns[i] = file_writer->AppendRowGroup()
   >   writeAllValues(columns[i])
   > ```
   > 
   > Above are user-known api which you may familiar with. Now let's explain 
the interface of this patch, and why previous patch is trickey.
   > 
   > An active Column uses `(Typed)ColumnWriter` to write values, and 
ColumnWriter holds a `PageWriter` to write page to sink. If no dict is enabled, 
it would be like:
   > 
   > ## Type1: non-buffered-non-dict:
   > ```
   > Only Have One Active Column Writer
   > Column2: |  DataPage1 |   DataPage2 | Buffered-Values that didn't becoming 
a page |
   > ```
   > 
   > For it, the output bytes are:
   > 
   > 1. `RowGroupWriter.total_compressed_bytes`: 0
   >    2.`RowGroupWriter.total_bytes_written`: It **doesn't** mean the 
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. In fact, it means 
the **uncompressed** `Column1-IO-Bytes + Column2-DataPage1 + Column2-DataPage2` 
if compression like zstd enabled
   > 2. (New in this patch) `RowGroupWriter.total_bytes_written`: It means the 
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. It will be a 
little less than really io-bytes, because page and column may write some 
metadata. But they will be close
   > 3. `ColumnWriter.total_compressed_bytes`: 0
   > 4. `ColumnWriter.total_bytes_written`: **uncompressed** `Column2-DataPage1 
+ Column2-DataPage2`
   > 5. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: It 
means the `Column2-DataPage1 + Column2-DataPage2`.
   > 
   > ## Type-2 non-buffered-dict:
   > Now, things would be a little different. `ColumnWriter` has a 
`data_pages_` vector for pages.
   > 
   > ```
   > Only Have One Active Column Writer
   > Column2:  | Buffered-Values that didn't becoming a page |
   > * Column2: buffers DataPage1  and  DataPage2
   > * Column2: Building Dictionary Page
   > ```
   > 
   > Now, let's revisit the ColumnWriter's size:
   > 
   > 1. `ColumnWriter.total_compressed_bytes`: **compressed** 
`Column2-DataPage1 + Column2-DataPage2`
   > 2. `ColumnWriter.total_bytes_written`: 0
   > 3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
   > 
   > ## Type3: buffered-non-dict:
   > ```
   > Have Two Active Column Writer
   > 
   > (All-Values are Buffered)
   > 
   > Column1: |  DataPage1 |  Buffered-Values that didn't becoming a page | 
   > Column2: |  DataPage1 |   DataPage2 | Buffered-Values that didn't becoming 
a page |
   > ```
   > 
   > 1. `ColumnWriter.total_compressed_bytes`: Compressed `c1.DataPage1 + 
c2.DataPage1 + c3.DataPage2`
   > 2. `ColumnWriter.total_bytes_written`: 0
   > 3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
   
   It seems that every reviewer or reader may ask the similar questions. It 
would be worthy to add this explanation somewhere in the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to