blackfox1983 opened a new issue #4718:
URL: https://github.com/apache/incubator-doris/issues/4718


   **Is your feature request related to a problem? Please describe.**
   At present, Doris uses a 32-bit integer signature for string type in bitmap. 
e.g. count(distinct v1). v1 is bitmap type which use bitmap_hash to calculate 
the hash_value.
   
   Although the performance of 32-bit integer signature is better than that of 
64 bit, the data precision is low due to the collision rate.
   
   Therefore, the result value in Doris is inconsistent with the result value 
calculated offline, so we need to explain to the user the reason for the data 
diff: whether it is caused by the error or the SQL code bug. Gradually, the 
user no longer believes in the data result of Doris.
   
   An erroneous result is more unacceptable than a slow query.
   
   **Describe the solution you'd like**
   The result value returned by Doris is accurate (100% consistent with the 
result value calculated offline, e.g. sort -u | wc -l)
   
   **Describe alternatives you've considered**
   Current bitmap_hash uses 32-bit integer to calculate the signature. We can 
add a 64 bit signature function. And it is better to specify the signature 
algorithm, e.g. murmur3_hash64, so that the subsequent expansion of the 
signature algorithm is also extended. For example, other signature algorithms 
are used to calculate the signature, e.g. xxx_hash64
   
   **Additional context**
   At present, the Doris code contains murmur2 / murmur3 signature algorithm. 
   
   - murmur_hash3.h: murmur_hash3_x64_64
   - hash_util.hpp: murmur_hash3_32/murmur_hash2_64/murmur_ hash64A (The latter 
two results are consistent)
   - seed in the hash function: we use 104729
   
   If we use murmur32 signature for nearly 100 million data, there will be 2 ‰ 
- 8 ‰ error. 
   When using murmur64 signature, the error is zero.
   
   In addition, considering the implementation of bitmap, try to consider the 
signature algorithm witch has smaller high 32 bits.
   
   We find that the distribution of the high 32-bit values of the signature 
based on 64bit-sign-function is consistent (< 5%%). No one is significantly 
bigger than the other.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to