This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 883ae8a86d [typo](docs) Add some content for bitmap_hash.md. (#17747)
883ae8a86d is described below

commit 883ae8a86d042233cb3bdde977284f5552db3f8b
Author: yagagagaga <[email protected]>
AuthorDate: Tue Mar 14 08:27:07 2023 +0800

    [typo](docs) Add some content for bitmap_hash.md. (#17747)
---
 .../sql-functions/bitmap-functions/bitmap_hash.md  | 86 +++++++++++++++++++---
 .../sql-functions/bitmap-functions/bitmap_hash.md  | 86 +++++++++++++++++++---
 2 files changed, 148 insertions(+), 24 deletions(-)

diff --git 
a/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md 
b/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
index 1b19f5a07c..20a7324778 100644
--- a/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
+++ b/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
@@ -25,28 +25,90 @@ under the License.
 -->
 
 ## bitmap_hash
-### description
+
+### Name
+
+BITMAP_HASH
+
+### Description
+
+Calculating hash value for what your input and return a BITMAP which contain 
the hash value. MurMur3 is used for this function because it is 
high-performance with low collision rate. More important, the MurMur3 
distribution is "simili-random"; the Chi-Square distribution test is used to 
prove it. By the way, Different hardware platforms and different SEED may 
change the result of MurMur3. For more information about its performance, see 
[Smhasher](http://rurban.github.io/smhasher/).
+
 #### Syntax
 
-`BITMAP BITMAP_HASH(expr)`
+```
+BITMAP BITMAP_HASH(<any_value>)
+```
+
+#### Arguments
+
+`<any_value>`
+any value or expression. 
+
+#### Return Type
+
+BITMAP
 
-Compute the 32-bits hash value of a expr of any type, then return a bitmap 
containing that hash value. Mainly be used to load non-integer value into 
bitmap column, e.g.,
+#### Remarks
 
+Generally, MurMurHash 32 is friendly to random, short STRING with low 
collision rate about one-billionth. But for longer STRING, such as your path of 
system, can cause more frequent collision. If you indexed your system path, you 
will find a lot of collisions.
+
+The following two values are the same.
+
+```sql
+SELECT 
bitmap_to_string(bitmap_hash('/System/Volumes/Data/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/KernelManagement.framework/KernelManagement.tbd'))
 AS a ,
+       
bitmap_to_string(bitmap_hash('/System/Library/PrivateFrameworks/Install.framework/Versions/Current/Resources/es_419.lproj/Architectures.strings'))
 AS b;
 ```
-cat data | curl --location-trusted -u user:passwd -T - -H "columns: 
dt,page,device_id, device_id=bitmap_hash(device_id)"   
http://host:8410/api/test/testDb/_stream_load
+
+Here is the result.
+
+```text
++-----------+-----------+
+| a         | b         |
++-----------+-----------+
+| 282251871 | 282251871 |
++-----------+-----------+
 ```
 
-### example
+### Example
+
+If you want to calculate MurMur3 of a certain value, you can
 
 ```
-mysql> select bitmap_count(bitmap_hash('hello'));
-+------------------------------------+
-| bitmap_count(bitmap_hash('hello')) |
-+------------------------------------+
-|                                  1 |
-+------------------------------------+
+select bitmap_to_array(bitmap_hash('hello'))[1];
 ```
 
-### keywords
+Here is the result.
+
+```text
++-------------------------------------------------------------+
+| %element_extract%(bitmap_to_array(bitmap_hash('hello')), 1) |
++-------------------------------------------------------------+
+|                                                  1321743225 |
++-------------------------------------------------------------+
+```
+
+If you want to `count distinct` some columns, using bitmap has higher 
performance in some scenes. 
+
+```sql
+select bitmap_count(bitmap_union(bitmap_hash(`word`))) from `words`;
+```
+
+Here is the result.
+
+```text
++-------------------------------------------------+
+| bitmap_count(bitmap_union(bitmap_hash(`word`))) |
++-------------------------------------------------+
+|                                        33263478 |
++-------------------------------------------------+
+```
+
+### Keywords
 
     BITMAP_HASH,BITMAP
+
+### Best Practice
+
+For more information, see also:
+- [BITMAP_HASH64](./bitmap_hash64.md)
diff --git 
a/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md 
b/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
index 37496a45ed..c9d64f7ca3 100644
--- a/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
+++ b/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md
@@ -25,28 +25,90 @@ under the License.
 -->
 
 ## bitmap_hash
-### description
+
+### Name
+
+BITMAP_HASH
+
+### Description
+
+对任意类型的输入,计算其 32 位的哈希值,并返回包含该哈希值的 bitmap。该函数使用的哈希算法为 MurMur3。MurMur3 
算法是一种高性能的、低碰撞率的散列算法,其计算出来的值接近于随机分布,并且能通过卡方分布测试。需要注意的是,不同硬件平台、不同 Seed 
值计算出来的散列值可能不同。关于此算法的性能可以参考 [Smhasher](http://rurban.github.io/smhasher/) 排行榜。
+
 #### Syntax
 
-`BITMAP BITMAP_HASH(expr)`
+```
+BITMAP BITMAP_HASH(<any_value>)
+```
+
+#### Arguments
+
+`<any_value>`
+任何值或字段表达式。
+
+#### Return Type
+
+BITMAP
 
-对任意类型的输入计算32位的哈希值,返回包含该哈希值的bitmap。主要用于stream load任务将非整型字段导入Doris表的bitmap字段。例如
+#### Remarks
 
+一般来说,MurMur 32 
位算法对于完全随机的、较短的字符串的散列效果较好,碰撞率能达到几十亿分之一,但对于较长的字符串,比如你的操作系统路径,碰撞率会比较高。如果你扫描你系统里的路径,就会发现碰撞率仅仅只能达到百万分之一甚至是十万分之一。
+
+下面两个字符串的 MurMur3 散列值是一样的:
+
+```sql
+SELECT 
bitmap_to_string(bitmap_hash('/System/Volumes/Data/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/KernelManagement.framework/KernelManagement.tbd'))
 AS a ,
+       
bitmap_to_string(bitmap_hash('/System/Library/PrivateFrameworks/Install.framework/Versions/Current/Resources/es_419.lproj/Architectures.strings'))
 AS b;
 ```
-cat data | curl --location-trusted -u user:passwd -T - -H "columns: 
dt,page,device_id, device_id=bitmap_hash(device_id)"   
http://host:8410/api/test/testDb/_stream_load
+
+结果如下:
+
+```text
++-----------+-----------+
+| a         | b         |
++-----------+-----------+
+| 282251871 | 282251871 |
++-----------+-----------+
 ```
 
-### example
+### Example
+
+如果你想计算某个值的 MurMur3,你可以:
 
 ```
-mysql> select bitmap_count(bitmap_hash('hello'));
-+------------------------------------+
-| bitmap_count(bitmap_hash('hello')) |
-+------------------------------------+
-|                                  1 |
-+------------------------------------+
+select bitmap_to_array(bitmap_hash('hello'))[1];
 ```
 
-### keywords
+结果如下:
+
+```text
++-------------------------------------------------------------+
+| %element_extract%(bitmap_to_array(bitmap_hash('hello')), 1) |
++-------------------------------------------------------------+
+|                                                  1321743225 |
++-------------------------------------------------------------+
+```
+
+如果你想统计某一列去重后的个数,可以使用位图的方式,某些场景下性能比 `count distinct` 好很多:
+
+```sql
+select bitmap_count(bitmap_union(bitmap_hash(`word`))) from `words`;
+```
+
+结果如下:
+
+```text
++-------------------------------------------------+
+| bitmap_count(bitmap_union(bitmap_hash(`word`))) |
++-------------------------------------------------+
+|                                        33263478 |
++-------------------------------------------------+
+```
+
+### Keywords
 
     BITMAP_HASH,BITMAP
+
+### Best Practice
+
+还可参见
+- [BITMAP_HASH64](./bitmap_hash64.md)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to