JackDrogon opened a new pull request, #15250:
URL: https://github.com/apache/doris/pull/15250

   # Proposed changes
   
   ## Problem summary
   
   用户经常设置不合适的bucket,导致各种问题,这里提供一种方式,来自动设置分桶数。
   
   ## 实现思路
   根据数据量,计算分桶数。
   对于分区表,可以根据历史分区的数据量、机器数、盘数,确定一个分桶。
   主要问题是初始桶数不好确定。
   这里提供两种方式:
   1. 根据机器数、盘数,确定一个分桶数
   2. 用户可以提供一个数据量的经验值,根据这个值,确定分桶数。
   
   ### 详细设计
   1. 建表
   ```
   create table tbl1
   (...)
   [PARTITION BY RANGE(...)]
   DISTRIBUTED BY HASH(k1) BUCKETS 0
   properties(
       ["estimate_partition_size" = "100G"]
   )
   ```
   
   - BUCKETS 0 表示自动设定buckets
   - estimate_partition_size:可选参数,提供一个单分区初始数据量。
   
   2. 分桶计算逻辑
   初始分桶计算
   - 没有给 estimate_partition_size
   这种基本上不太靠谱。感觉直接拍一个就行了,比如 11。
   - 给了 estimate_partition_size
   
   这里我们先假设给的是单副本文本格式的数据量
   1. 先根据数据量得出一个桶数:N
       首先数据量除以5(按5比1的压缩比算)
       < 100MB : 1
       < 1G: 2
       > 1G:  每1G一个分桶。
   
   2. 根据桶数和盘数的乘机得出一个桶数 M
       每个BE节点算1
       磁盘容量,每50G算1
       
   4. min(M, N, 128),如果这个值小于N,也小于机器数。取机器数。
   
   举例:
   ```
   1. 100MB,10台机器,2T * 3 = 1
   2. 1G, 3台机器,500GB * 2 = 2
   5. 100G,3台机器,500GB * 2 = 60 (这个case参考tpch100,我们是48个分桶)
   3. 500G,3台机器,1T * 1 = 60
   4. 500G,10台机器,2T * 3 = 128
   6. 1T,10台机器,2T * 3 = 128 
   7. 500G,1台机器,100TB * 1 = 128
   8. 1TB, 200台机器,4T * 7 = 200
   ```
   
   计算未来分桶
   仅针对分区表。
   根据最多前7个分区的数据量的指数平均值,作为estimate_partition_size,进行评估。
   需要判断历史分区的趋势:
       比如前五个分区,每个都比前一个大,说明数据再增长,则此时不能求平均值,而应该取趋势值。
       仅考虑递增和递减的情况。其他情况,求平均。
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [ ] No
       - [ ] I don't know
   2. Has unit tests been added:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   3. Has document been added or modified:
       - [ ] Yes
       - [ ] No
       - [ ] No Need
   4. Does it need to update dependencies:
       - [ ] Yes
       - [ ] No
   5. Are there any changes that cannot be rolled back:
       - [ ] Yes (If Yes, please explain WHY)
       - [ ] No
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[[email protected]](mailto:[email protected]) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to