Re: [PR] PARQUET-2411: [C++] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader [arrow]

via GitHub Fri, 08 Dec 2023 19:37:23 -0800


wgtmac commented on code in PR #39153:
URL: https://github.com/apache/arrow/pull/39153#discussion_r1421241611



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1369,6 +1369,24 @@ class TypedRecordReader : public 
TypedColumnReaderImpl<DType>,
     return bytes_for_values;
   }
 
+  const uint8_t* ReadDictionary(int32_t* dictionary_length) override {
+    if (this->current_decoder_ == nullptr && !this->HasNextInternal()) {
+      dictionary_length = 0;
+      return nullptr;
+    }
+    // Verify the current data page is dictionary encoded.
+    if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {
+      std::stringstream ss;
+      ss << "Data page is not dictionary encoded. Encoding: "
+         << EncodingToString(this->current_encoding_);
+      throw ParquetException(ss.str());
+    }
+    auto decoder = dynamic_cast<DictDecoder<DType>*>(this->current_decoder_);

Review Comment:
   nit: use `arrow::internal::checked_cast`



##########
cpp/src/parquet/file_reader.h:
##########
@@ -80,6 +81,18 @@ class PARQUET_EXPORT RowGroupReader {
   std::shared_ptr<ColumnReader> ColumnWithExposeEncoding(
       int i, ExposedEncoding encoding_to_expose);
 
+  // Construct a RecordReader, trying to enable exposed encoding.
+  //
+  // For dictionary encoding, currently we only support column chunks that are
+  // fully dictionary encoded byte arrays. The caller can verify if the reader 
can read
+  // and expose the dictionary by checking the reader's read_dictionary(). If 
a column
+  // chunk uses dictionary encoding but then falls back to plain encoding, the 
returned
+  // reader will read decoded data without exposing the dictionary.

Review Comment:
   Should we throw if the column chunk fallbacks half way to avoid misuse?



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1369,6 +1369,24 @@ class TypedRecordReader : public 
TypedColumnReaderImpl<DType>,
     return bytes_for_values;
   }
 
+  const uint8_t* ReadDictionary(int32_t* dictionary_length) override {
+    if (this->current_decoder_ == nullptr && !this->HasNextInternal()) {
+      dictionary_length = 0;
+      return nullptr;
+    }
+    // Verify the current data page is dictionary encoded.
+    if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {

Review Comment:
   Perhaps adding a comment to clear the confusion?



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -61,6 +61,34 @@ static constexpr uint32_t kFooterSize = 8;
 // For PARQUET-816
 static constexpr int64_t kMaxDictHeaderSize = 100;
 
+bool IsColumnChunkFullyDictionaryEncoded(const ColumnChunkMetaData& col) {

Review Comment:
   Move it into anonymous namespace?



##########
cpp/src/parquet/column_reader.h:
##########
@@ -368,6 +368,11 @@ class PARQUET_EXPORT RecordReader {
 
   virtual void DebugPrintState() = 0;
 
+  /// \brief Returns the dictionary owned by the current decoder. Throws an
+  /// exception if the current decoder is not for dictionary encoding.
+  /// \param[out] dictionary_length The number of dictionary entries.
+  virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0;

Review Comment:
   It seems that comment of `values()` says `FLBA and ByteArray types do not 
use this array`. Should we change this into a template function or simply 
return `const void*` so downstream is required to be aware of the type to avoid 
misuse the `uint8_t*`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2411: [C++] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader [arrow]

Reply via email to