zhbinbin commented on a change in pull request #4198:
URL: https://github.com/apache/incubator-doris/pull/4198#discussion_r463459320
##########
File path:
docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md
##########
@@ -0,0 +1,209 @@
+---
+{
+ "title": "BITMAP正交计算UDAF",
+ "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# BITMAP正交计算UDAF
+
+## 背景
+
+Doris原有的Bitmap聚合函数设计比较通用,但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑,发现主要有两个原因。一是当bitmap基数较大时,如数据大小超过1g,网络/磁盘IO处理时间比较长;二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算,给顶层单节点带来压力,成为处理瓶颈。
+
+解决方案是建表时增加hid列,罐库时hid列按照bitmap列的range划分,并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上,优化udaf聚合函数,使其在所有扫描节点参与分布式正交并算,然后在顶层节点进行汇总,如此会大大提高计算效率。
+
+## Create table
+
+建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union
+
+```
+CREATE TABLE `user_tag_bitmap` (
+ `tag` bigint(20) NULL COMMENT "用户标签",
+ `hid` smallint(6) NULL COMMENT "分桶id",
+ `user_id` bitmap BITMAP_UNION NULL COMMENT ""
+) ENGINE=OLAP
+AGGREGATE KEY(`tag`, `hid`)
+COMMENT "OLAP"
+DISTRIBUTED BY HASH(`hid`) BUCKETS 3
+```
+表schema增加hid列,表示id范围, 作为hash分桶列。
+
+注:hid数和BUCKETS要设置合理,hid数设置至少是BUCKETS的5倍以上,以使数据hash分桶尽量均衡
+
+## Data Load
+
+```
+LOAD LABEL user_tag_bitmap_test
+(
+DATA INFILE('hdfs://abc')
+INTO TABLE user_tag_bitmap
+COLUMNS TERMINATED BY ','
+(tmp_tag, tmp_user_id)
+SET (
+tag = tmp_tag,
+hid = ceil(tmp_user_id/5000000),
+user_id = to_bitmap(tmp_user_id)
+)
+)
+...
+```
+数据格式:
+```
+11111111,1
+11111112,2
+11111113,3
+11111114,4
+...
+```
+注:第一列代表用户标签,如'男', '90后', '10-20万'等,已由中文转换成数字
+
+load数据时,对用户bitmap进行纵向切割,例如,用户id在1-5000000范围内的hid相同,hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上,可以利用tablet在be上平均分散的特性,在local节点scan数据后,直接进行交集、并集计算,在top节点merge阶段进行汇总计算结果,此设计能充分发挥所有be并发计算的特性。
+
+## 自定义UDAF
+Doris查询前设置参数
+```
+set parallel_fragment_exec_instance_num=5
+```
+注:根据集群情况设置并发参数,提高并发计算性能
+
+新udaf需要在doris定义聚合函数时注册函数符号,函数符号通过动态库.so的方式被加载。
+
+### bitmap_orthogonal_intersect
+
+求交集函数
+ bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values)
+
+参数:
+ 第一个参数是Bitmap列,第二个参数是用来过滤的维度列,第三个参数开始是变长参数,含义是过滤维度列的不同取值
+
+说明:
+
此udaf,在此表schema的基础上,查询规划上聚合分2层,在第一层be节点(update、serialize)先按filter_values为key进行hash聚合,然后对所有key的bitmap求交集,结果序列化后发送至第二层be节点(merge、finalize),在第二层be节点对所有来源于第一层节点的bitmap值循环求并集
+
+
+定义:
+```
+drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...);
+CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT,
...) RETURNS BITMAP INTERMEDIATE varchar(1)
Review comment:
这个是指哪个名称?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]