[jira] [Commented] (IMPALA-12393) DictEncoder uses inconsistent hash function for TimestampValue

Joe McDonnell (Jira) Tue, 22 Aug 2023 11:49:32 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757624#comment-17757624
 ]


Joe McDonnell commented on IMPALA-12393:
----------------------------------------

I see this as a bug. I'm not sure if it is actually user-visible.

The bug would come up when inserting timestamps into a Parquet table. If the 
hash is inconsistent, then the dictionary encoding sometimes treats two 
equivalent values as different. This prevents the dictionary encoding from 
working properly, so you can end up with a dictionary with multiple copies of 
the same value using different integer representations. This would result in 
larger Parquet files. I haven't confirmed this Parquet encoding case, but it 
seems possible. It would happen unless we always clear the memory so the 
padding is zero. I don't think we guarantee that.

> DictEncoder uses inconsistent hash function for TimestampValue
> --------------------------------------------------------------
>
>                 Key: IMPALA-12393
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12393
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> DictEncoder currently uses this hash function for TimestampValue:
> {noformat}
> template<typename T>
> inline uint32_t DictEncoder<T>::Hash(const T& value) const {
>   return HashUtil::Hash(&value, sizeof(value), 0);
> }{noformat}
> TimestampValue has some padding, and nothing ensures that the padding is 
> cleared. This means that identical TimestampValue objects can hash to 
> different values.
> This came up when fixing a Clang-Tidy performance check. This line in 
> dict-test.cc changed from iterating over values to iterating over const 
> references.
> {noformat}
>   DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, 
> &track_encoder);
>   encoder.UsedbyTest();
> <<<<<<
>   for (InternalType i: values) encoder.Put(i);
> =====
>   for (const InternalType& i: values) encoder.Put(i);
> >>>>>
>   bytes_alloc = encoder.DictByteSize();
>   EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
>   EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
> The test became flaky, with the encoder.num_entries() being larger than the 
> values_set.size() for TimestampValue. This happened because the hash values 
> didn't match even for identical entries and the dictionary would have 
> multiple copies of the same value. When iterating over a plain non-reference 
> TimestampValue, each TimestampValue is being copied to a temporary value. 
> Maybe in this circumstance the padding stays the same between iterations.
> It's possible this would come up when writing Parquet data files.
> One fix would be to use TimestampValue's Hash function, which ignores the 
> padding:
> {noformat}
> template<>
> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& 
> value) const {
>   return value.Hash();
> }{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12393) DictEncoder uses inconsistent hash function for TimestampValue

Reply via email to