[ 
https://issues.apache.org/jira/browse/PARQUET-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178385#comment-15178385
 ] 

Wes McKinney commented on PARQUET-555:
--------------------------------------

Sorry, I misread the code, but we still have some inconsistencies. 

Currently, when any dictionary page is encountered, the encoding is forced to 
RLE_DICTIONARY and the dictionary page is assumed to be in plain encoding:

{code}
void TypedColumnReader<TYPE>::ConfigureDictionary(const DictionaryPage* page) {
  int encoding = static_cast<int>(Encoding::RLE_DICTIONARY);

  auto it = decoders_.find(encoding);
  if (it != decoders_.end()) {
    throw ParquetException("Column cannot have more than one dictionary.");
  }

  PlainDecoder<TYPE> dictionary(descr_);
  dictionary.SetData(page->num_values(), page->data(), page->size());
{code}

In fact, the dictionary need not be PLAIN encoding (we only support PLAIN at 
the moment -- Parquet 2.0 byte array encodings could be used). Later on, if a 
data page in either PLAIN_DICTIONARY or RLE_DICTIONARY is encountered, the 
encoding is forced to RLE_DICTIONARY. 

{code}
      if (IsDictionaryIndexEncoding(encoding)) {
        encoding = Encoding::RLE_DICTIONARY;
      }

      auto it = decoders_.find(static_cast<int>(encoding));
      if (it != decoders_.end()) {
        current_decoder_ = it->second.get();
      } else {
        switch (encoding) {
          case Encoding::PLAIN: {
            std::shared_ptr<DecoderType> decoder(new 
PlainDecoder<TYPE>(descr_));
            decoders_[static_cast<int>(encoding)] = decoder;
            current_decoder_ = decoder.get();
            break;
          }
          case Encoding::RLE_DICTIONARY:
            throw ParquetException("Dictionary page must be before data page.");
{code}

This is pretty loose logic. If the encoding is not PLAIN or PLAIN_DICTIONARY in 
the Dictionary page, we should throw an exception. 

> Encoding::PLAIN_DICTIONARY is unhandled in ColumnReader
> -------------------------------------------------------
>
>                 Key: PARQUET-555
>                 URL: https://issues.apache.org/jira/browse/PARQUET-555
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Deepak Majeti
>
> See parquet-format re: PLAIN_DICTIONARY / RLE_DICTIONARY distinction — our 
> handling of the page metadata is not consistent with the format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to