Henry Robinson created SPARK-22736:
--------------------------------------

             Summary: Consider caching decoded dictionaries in 
VectorizedColumnReader
                 Key: SPARK-22736
                 URL: https://issues.apache.org/jira/browse/SPARK-22736
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.1
            Reporter: Henry Robinson


{{VectorizedColumnReader.decodeDictionaryIds()}} calls {{dictionary.decodeToX}} 
for every dictionary ID encountered in a dict-encoded Parquet page.

The whole idea of dictionary encoding is that a) values are repeated in a page 
and b) the dictionary only contains values that are in a page. So we should be 
able to save some decoding cost by decoding the entire dictionary page once, at 
the cost of using some memory (but theoretically we could discard the encoded 
dictionary, I think), and using the decoded dictionary to populate rows. 

This is particularly true for TIMESTAMP data, which after SPARK-12297, might 
have a timezone conversion as part of its decoding step.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to