Re: [PR] GH-39676: [C++][Parquet] Fast Random Rowgroup Reads [arrow]

via GitHub Fri, 19 Jan 2024 17:06:00 -0800


corwinjoy commented on code in PR #39677:
URL: https://github.com/apache/arrow/pull/39677#discussion_r1460107417



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -323,6 +323,19 @@ class SerializedPageWriter : public PageWriter {
 
     PARQUET_THROW_NOT_OK(sink_->Write(output_data_buffer, output_data_len));
 
+    // Add the dictionary page to the offsets
+    if (offset_index_builder_ != nullptr) {
+      const int64_t compressed_size = output_data_len + header_size;
+      if (compressed_size > std::numeric_limits<int32_t>::max()) {
+        throw ParquetException("Compressed dictionary page size overflows to 
INT32_MAX.");
+      }
+
+      /// start_pos is a relative offset in the buffered mode. It should be
+      /// adjusted via OffsetIndexBuilder::Finish() after BufferedPageWriter
+      /// has flushed all data pages.
+      offset_index_builder_->AddPage(start_pos, 
static_cast<int32_t>(compressed_size), -1);
+    }

Review Comment:
   @emkornfield To be specific I do actually mean dictionary page in a much 
more specific way. As seen in the
   
[spec:](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559)
   ```
   enum PageType {
     DATA_PAGE = 0;
     INDEX_PAGE = 1;
     DICTIONARY_PAGE = 2;
     DATA_PAGE_V2 = 3;
   }
   ```
   
   And this is reflected in the 
[PageHeader](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/src/main/thrift/parquet.thrift#L693)
   ```
   struct PageHeader {
     /** the type of the page: indicates which of the *_header fields is set **/
     1: required PageType type
   
     /** Uncompressed page size in bytes (not including this header) **/
     2: required i32 uncompressed_page_size
   
     /** Compressed (and potentially encrypted) page size in bytes, not 
including this header **/
     3: required i32 compressed_page_size
   
     /** The 32-bit CRC checksum for the page, to be be calculated as follows:
      ...
      */
     4: optional i32 crc
   
     // Headers for page specific data.  One only will be set.
     5: optional DataPageHeader data_page_header;
     6: optional IndexPageHeader index_page_header;
     7: optional DictionaryPageHeader dictionary_page_header;
     8: optional DataPageHeaderV2 data_page_header_v2;
   }
   ```
   So that the dictionary page is just one kind of data page. This is reflected 
in the reader class as in column_page.h:150
   ```
   class DictionaryPage : public Page {
   ...
   }
   ```
   
   
   So, when dictionary encoded data is written to the file, first a dictionary 
page is written, then data pages are written. The parquet reader enforces and 
leverages this fact. It actually reads from a single collection of pages where 
it requires that the first page be a dictionary page for dictionary encoded 
pages. So, including the encoding page in the OffsetIndex seems pretty natural 
to me.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-39676: [C++][Parquet] Fast Random Rowgroup Reads [arrow]

Reply via email to