emkornfield commented on a change in pull request #10537:
URL: https://github.com/apache/arrow/pull/10537#discussion_r652313111
##########
File path: cpp/src/parquet/column_reader.h
##########
@@ -201,6 +214,36 @@ class TypedColumnReader : public ColumnReader {
// Skip reading levels
// Returns the number of levels skipped
virtual int64_t Skip(int64_t num_rows_to_skip) = 0;
+
+ // Read a batch of repetition levels, definition levels, and indices from the
+ // column. And read the dictionary if a dictionary page is encountered during
+ // reading pages. This API is similar to ReadBatch(), with ability to read
+ // dictionary and indices. It's only valid when the reader can expose
+ // dictionary encoding. (i.e., the reader's GetExposedEncoding() returns
+ // DICTIONARY).
+ //
+ // Since a column chunk can only have one dictinoary page followed by all
data
+ // pages, the dictionary page is read only once for each column chunk (upon
+ // reading the 1st batch).
+ //
+ // @param batch_size The batch size to read
+ // @param[out] def_levels The Parquet definition levels.
+ // @param[out] rep_levels The Parquet repetition levels.
+ // @param[out] indices The dictionary indices.
+ // @param[out] indices_read The number of indices read.
+ // @param[out] dict The pointer to dictionary values. It's set only if the
Review comment:
you might want to comment that these will not expose the dictionary if
there are no data pages (not sure if this can happen in theory or in practice).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]