[
https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Smith resolved IMPALA-12393.
------------------------------------
Fix Version/s: Impala 4.3.0
Resolution: Fixed
> DictEncoder uses inconsistent hash function for TimestampValue
> --------------------------------------------------------------
>
> Key: IMPALA-12393
> URL: https://issues.apache.org/jira/browse/IMPALA-12393
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.3.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Major
> Fix For: Impala 4.3.0
>
>
> DictEncoder currently uses this hash function for TimestampValue:
> {noformat}
> template<typename T>
> inline uint32_t DictEncoder<T>::Hash(const T& value) const {
> return HashUtil::Hash(&value, sizeof(value), 0);
> }{noformat}
> TimestampValue has some padding, and nothing ensures that the padding is
> cleared. This means that identical TimestampValue objects can hash to
> different values.
> This came up when fixing a Clang-Tidy performance check. This line in
> dict-test.cc changed from iterating over values to iterating over const
> references.
> {noformat}
> DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size,
> &track_encoder);
> encoder.UsedbyTest();
> <<<<<<
> for (InternalType i: values) encoder.Put(i);
> =====
> for (const InternalType& i: values) encoder.Put(i);
> >>>>>
> bytes_alloc = encoder.DictByteSize();
> EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
> EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
> The test became flaky, with the encoder.num_entries() being larger than the
> values_set.size() for TimestampValue. This happened because the hash values
> didn't match even for identical entries and the dictionary would have
> multiple copies of the same value. When iterating over a plain non-reference
> TimestampValue, each TimestampValue is being copied to a temporary value.
> Maybe in this circumstance the padding stays the same between iterations.
> It's possible this would come up when writing Parquet data files.
> One fix would be to use TimestampValue's Hash function, which ignores the
> padding:
> {noformat}
> template<>
> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue&
> value) const {
> return value.Hash();
> }{noformat}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)