[GitHub] [arrow] mapleFU commented on pull request #33897: GH-33652: [C++][Parquet] Add interface total_compressed_bytes_written

via GitHub Wed, 22 Feb 2023 06:32:07 -0800


mapleFU commented on PR #33897:
URL: https://github.com/apache/arrow/pull/33897#issuecomment-1440131022


   @westonpace Let me explain it detail here:
   
   Parquet, which may contains multiple rowgroups, rowgroups may contain 
multiple columns, and columns may contain multiple pages.
   
   For a `ParquetFileWriter`, use can use `AppendBufferedRowGroup` and 
`AppendRowGroup` to acquire a buffered and unbuffered rowgroup writer.
   
   For buffered `RowGroupWriter`, it can:
   
   ```
   for column in columns:
     columns[i] = file_writer->AppendBufferedRowGroup()
   
   for value in value-batches:
     column_writer = columns[i]->column()
     typed_column_writer->Write()
   ```
   
   For unbuffered `RowGroupWriter`, it can:
   
   ```
   for column in columns:
     columns[i] = file_writer->AppendRowGroup()
     writeAllValues(columns[i])
   ```
   
   Below are user-known api which you may familiar with. Now let's explain the 
interface of this patch, and why previous patch is trickey.
   
   An active Column uses `(Typed)ColumnWriter` to write values, and 
ColumnWriter holds a `PageWriter` to write page to sink.  If no dict is 
enabled, it would be like:
   
   ## Type1: non-buffered-non-dict:
   
   ```
   Only Have One Active Column Writer
   Column2: |  DataPage1 |   DataPage2 | Buffered-Values that didn't becoming a 
page |
   ```
   
   For it, the output bytes are:
   1. `RowGroupWriter.total_compressed_bytes`: 0
   2.`RowGroupWriter.total_bytes_written`: It **doesn't** mean the 
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. In fact, it means 
the **uncompressed** `Column1-IO-Bytes + Column2-DataPage1 + Column2-DataPage2` 
if compression like zstd enabled
   3. (New in this patch) `RowGroupWriter.total_bytes_written`: It means the 
`Column1-Data-Bytes + Column2-DataPage1 + Column2-DataPage2`. It will be a 
little less than really io-bytes, because page and column may write some 
metadata. But they will be close
   4. `ColumnWriter.total_compressed_bytes`: 0
   5. `ColumnWriter.total_bytes_written`: **uncompressed** `Column2-DataPage1 + 
Column2-DataPage2`
   6. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: It 
means the `Column2-DataPage1 + Column2-DataPage2`.
   
   ## Type-2 non-buffered-dict:
   
   Now, things would be a little different. `ColumnWriter` has a `data_pages_` 
vector for pages.
   
   ```
   Only Have One Active Column Writer
   Column2:  | Buffered-Values that didn't becoming a page |
   * Column2: buffers DataPage1  and  DataPage2
   * Column2: Building Dictionary Page
   ```
   
   Now, let's revisit the ColumnWriter's size:
   1. `ColumnWriter.total_compressed_bytes`: **compressed** `Column2-DataPage1 
+ Column2-DataPage2`
   2. `ColumnWriter.total_bytes_written`: 0
   3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
   
   ## Type3: buffered-non-dict:
   
   ```
   Have Two Active Column Writer
   Column2: |  DataPage1 |  Buffered-Values that didn't becoming a page |
   Column2: |  DataPage1 |   DataPage2 | Buffered-Values that didn't becoming a 
page |
   ```
   
   1. `ColumnWriter.total_compressed_bytes`: Compressed `c1.DataPage1 + 
c2.DataPage1 + c3.DataPage2`
   2. `ColumnWriter.total_bytes_written`: 0
   3. (New in this patch) `ColumnWriter.total_compressed_bytes_written`: 0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #33897: GH-33652: [C++][Parquet] Add interface total_compressed_bytes_written

Reply via email to