Joe McDonnell created IMPALA-12393:
--------------------------------------
Summary: DictEncoder uses inconsistent hash function for
TimestampValue
Key: IMPALA-12393
URL: https://issues.apache.org/jira/browse/IMPALA-12393
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Joe McDonnell
DictEncoder currently uses this hash function for TimestampValue:
{noformat}
template<typename T>
inline uint32_t DictEncoder<T>::Hash(const T& value) const {
return HashUtil::Hash(&value, sizeof(value), 0);
}{noformat}
TimestampValue has some padding, and nothing ensures that the padding is
cleared. This means that identical TimestampValue objects can hash to different
values.
This came up when fixing a Clang-Tidy performance check. This line in
dict-test.cc changed from iterating over values to iterating over const
references.
{noformat}
DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size,
&track_encoder);
encoder.UsedbyTest();
<<<<<<
for (InternalType i: values) encoder.Put(i);
=====
for (const InternalType& i: values) encoder.Put(i);
>>>>>
bytes_alloc = encoder.DictByteSize();
EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
The test became flaky, with the encoder.num_entries() being larger than the
values_set.size() for TimestampValue. This happened because the hash values
didn't match even for identical entries and the dictionary would have multiple
copies of the same value. When iterating over a plain non-reference
TimestampValue, each TimestampValue is being copied to a temporary value. Maybe
in this circumstance the padding stays the same between iterations.
It's possible this would come up when writing Parquet data files.
One fix would be to use TimestampValue's Hash function, which ignores the
padding:
{noformat}
template<>
inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& value)
const {
return value.Hash();
}{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)