[ 
https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758656#comment-17758656
 ] 

ASF subversion and git services commented on IMPALA-12393:
----------------------------------------------------------

Commit d96341ed537a3e321d5fa6a0235ab06b5d9169a2 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d96341ed5 ]

IMPALA-12393: Fix inconsistent hash for TimestampValue in DictEncoder

Currently, DictEncoder uses the default hash function for
TimestampValue, which means it is hashing the entire
TimestampValue struct. This can be inconsistent, because
TimestampValue contains some padding that may not be zero
in some cases. For TimestampValues that are part of a Tuple,
the padding is zero, so this is mainly present in test cases.

This was discovered when fixing a Clang Tidy performance-for-range-copy
warning by iterating with a const reference rather than
making a copy of the value. DictTest.TestTimestamps became
flaky with that change, because the hash was no longer
consistent. The copy must have had consistent content for
the padding through the iteration, but the const reference
did not.

This adds a template specialization of the Hash function
for TimestampValue. The specialization uses TimestampValue::Hash(),
which hashes only the non-padding pieces of the struct. This
also includes the change to dict-test.cc that uncovered the
issue. This fix is mostly to unblock IMPALA-12390.

Testing:
 - Ran dict-test in a loop for a few hundred iterations
 - Hand tested inserting many timestamps into a Parquet table
   with dictionary encoding and verified that the performance didn't
   change.

Change-Id: Iad86e9b0f645311c3389cf2804dcc1a346ff10a9
Reviewed-on: http://gerrit.cloudera.org:8080/20396
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Daniel Becker <[email protected]>
Reviewed-by: Michael Smith <[email protected]>


> DictEncoder uses inconsistent hash function for TimestampValue
> --------------------------------------------------------------
>
>                 Key: IMPALA-12393
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12393
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>             Fix For: Impala 4.3.0
>
>
> DictEncoder currently uses this hash function for TimestampValue:
> {noformat}
> template<typename T>
> inline uint32_t DictEncoder<T>::Hash(const T& value) const {
>   return HashUtil::Hash(&value, sizeof(value), 0);
> }{noformat}
> TimestampValue has some padding, and nothing ensures that the padding is 
> cleared. This means that identical TimestampValue objects can hash to 
> different values.
> This came up when fixing a Clang-Tidy performance check. This line in 
> dict-test.cc changed from iterating over values to iterating over const 
> references.
> {noformat}
>   DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, 
> &track_encoder);
>   encoder.UsedbyTest();
> <<<<<<
>   for (InternalType i: values) encoder.Put(i);
> =====
>   for (const InternalType& i: values) encoder.Put(i);
> >>>>>
>   bytes_alloc = encoder.DictByteSize();
>   EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
>   EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
> The test became flaky, with the encoder.num_entries() being larger than the 
> values_set.size() for TimestampValue. This happened because the hash values 
> didn't match even for identical entries and the dictionary would have 
> multiple copies of the same value. When iterating over a plain non-reference 
> TimestampValue, each TimestampValue is being copied to a temporary value. 
> Maybe in this circumstance the padding stays the same between iterations.
> It's possible this would come up when writing Parquet data files.
> One fix would be to use TimestampValue's Hash function, which ignores the 
> padding:
> {noformat}
> template<>
> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& 
> value) const {
>   return value.Hash();
> }{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to