Re: [PR] PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader [arrow]

via GitHub Sun, 10 Dec 2023 13:31:07 -0800


jp0317 commented on code in PR #39153:
URL: https://github.com/apache/arrow/pull/39153#discussion_r1421819720



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -61,6 +61,34 @@ static constexpr uint32_t kFooterSize = 8;
 // For PARQUET-816
 static constexpr int64_t kMaxDictHeaderSize = 100;
 
+bool IsColumnChunkFullyDictionaryEncoded(const ColumnChunkMetaData& col) {

Review Comment:
   done, thanks!



##########
cpp/src/parquet/file_reader.h:
##########
@@ -80,6 +81,18 @@ class PARQUET_EXPORT RowGroupReader {
   std::shared_ptr<ColumnReader> ColumnWithExposeEncoding(
       int i, ExposedEncoding encoding_to_expose);
 
+  // Construct a RecordReader, trying to enable exposed encoding.
+  //
+  // For dictionary encoding, currently we only support column chunks that are
+  // fully dictionary encoded byte arrays. The caller can verify if the reader 
can read
+  // and expose the dictionary by checking the reader's read_dictionary(). If 
a column
+  // chunk uses dictionary encoding but then falls back to plain encoding, the 
returned
+  // reader will read decoded data without exposing the dictionary.

Review Comment:
   if it falls back the read_dictionary() will return a normal reader without 
reading dictionary, I reword the comment to state that the caller should verify 
the reader using read_dictionary()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader [arrow]

Reply via email to