[
https://issues.apache.org/jira/browse/PARQUET-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178385#comment-15178385
]
Wes McKinney commented on PARQUET-555:
--------------------------------------
Sorry, I misread the code, but we still have some inconsistencies.
Currently, when any dictionary page is encountered, the encoding is forced to
RLE_DICTIONARY and the dictionary page is assumed to be in plain encoding:
{code}
void TypedColumnReader<TYPE>::ConfigureDictionary(const DictionaryPage* page) {
int encoding = static_cast<int>(Encoding::RLE_DICTIONARY);
auto it = decoders_.find(encoding);
if (it != decoders_.end()) {
throw ParquetException("Column cannot have more than one dictionary.");
}
PlainDecoder<TYPE> dictionary(descr_);
dictionary.SetData(page->num_values(), page->data(), page->size());
{code}
In fact, the dictionary need not be PLAIN encoding (we only support PLAIN at
the moment -- Parquet 2.0 byte array encodings could be used). Later on, if a
data page in either PLAIN_DICTIONARY or RLE_DICTIONARY is encountered, the
encoding is forced to RLE_DICTIONARY.
{code}
if (IsDictionaryIndexEncoding(encoding)) {
encoding = Encoding::RLE_DICTIONARY;
}
auto it = decoders_.find(static_cast<int>(encoding));
if (it != decoders_.end()) {
current_decoder_ = it->second.get();
} else {
switch (encoding) {
case Encoding::PLAIN: {
std::shared_ptr<DecoderType> decoder(new
PlainDecoder<TYPE>(descr_));
decoders_[static_cast<int>(encoding)] = decoder;
current_decoder_ = decoder.get();
break;
}
case Encoding::RLE_DICTIONARY:
throw ParquetException("Dictionary page must be before data page.");
{code}
This is pretty loose logic. If the encoding is not PLAIN or PLAIN_DICTIONARY in
the Dictionary page, we should throw an exception.
> Encoding::PLAIN_DICTIONARY is unhandled in ColumnReader
> -------------------------------------------------------
>
> Key: PARQUET-555
> URL: https://issues.apache.org/jira/browse/PARQUET-555
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Wes McKinney
> Assignee: Deepak Majeti
>
> See parquet-format re: PLAIN_DICTIONARY / RLE_DICTIONARY distinction — our
> handling of the page metadata is not consistent with the format.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)