corwinjoy commented on code in PR #39677:
URL: https://github.com/apache/arrow/pull/39677#discussion_r1458153531


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -323,6 +323,19 @@ class SerializedPageWriter : public PageWriter {
 
     PARQUET_THROW_NOT_OK(sink_->Write(output_data_buffer, output_data_len));
 
+    // Add the dictionary page to the offsets
+    if (offset_index_builder_ != nullptr) {
+      const int64_t compressed_size = output_data_len + header_size;
+      if (compressed_size > std::numeric_limits<int32_t>::max()) {
+        throw ParquetException("Compressed dictionary page size overflows to 
INT32_MAX.");
+      }
+
+      /// start_pos is a relative offset in the buffered mode. It should be
+      /// adjusted via OffsetIndexBuilder::Finish() after BufferedPageWriter
+      /// has flushed all data pages.
+      offset_index_builder_->AddPage(start_pos, 
static_cast<int32_t>(compressed_size), -1);
+    }

Review Comment:
   @emkornfield @wgtmac 
   So, what you are saying is correct, but it seems I have not clearly 
explained my reasoning here.
   The idea behind 
[PageIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) 
is to make "point lookups I/O efficient". So, you can look at a ColumnIndex to 
find a range of values for a particular column, then use the OffsetIndex to 
efficiently load that data. For directly encoded data, the OffsetIndex is 
great, you can directly load and examine column data. For dictionary encoded 
data there is a problem. You can't read the data without having the dictionary. 
So, I believe that the OffsetIndex should contain the dictionary page for two 
reasons:
   1. Without the dictionary page you cannot decode the data in subsequent data 
pages. Instead, you now have to go retrieve the information from the metadata. 
This violates the principle that the PageIndex should be a way to directly 
access data.
   2. The dictionary page, is in fact, a type of data page. So including it 
should be backwards compatible since readers should recognize its type in the 
list of pages provided by an OffsetIndex.
   
   Sadly, the spec seems to be silent on what should happen in the case of 
dictionary encodings. It seems that most writers do not provide this 
information. I wonder if @mkornacker or @lekv considered dictionary encodings 
or would be willing to comment?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to