feiniaofeiafei commented on code in PR #55472:
URL: https://github.com/apache/doris/pull/55472#discussion_r2315529033


##########
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java:
##########
@@ -769,14 +769,40 @@ public class SessionVariable implements Serializable, 
Writable {
 
     public static final String SKEW_REWRITE_AGG_BUCKET_NUM = 
"skew_rewrite_agg_bucket_num";
 
+    public static final String HOT_VALUE_COLLECT_COUNT = 
"hot_value_collect_count";
+    @VariableMgr.VarAttr(name = HOT_VALUE_COLLECT_COUNT, needForward = true,
+                description = {"列统计信息收集时,收集占比排名前 HOT_VALUE_COLLECT_COUNT 
的值作为hot value",
+                        "When collecting column statistics, collect the top 
values ranked by their "
+                                + "proportion as hot values, up to 
HOT_VALUE_COLLECT_COUNT."})
+    public int hotValueCollectCount = 10; // Select the values that account 
for at least 10% of the column
+
+    public void setHotValueCollectCount(int count) {
+        this.hotValueCollectCount = count;
+    }
+
+    public static int getHotValueCollectCount() {
+        if (ConnectContext.get() != null) {
+            if (ConnectContext.get().getState().isInternal()) {
+                return 0;
+            } else {
+                return 
ConnectContext.get().getSessionVariable().hotValueCollectCount;
+            }
+        } else {
+            return 
Integer.parseInt(VariableMgr.getDefaultValue(HOT_VALUE_COLLECT_COUNT));
+        }
+    }
+
     public static final String HOT_VALUE_THRESHOLD = "hot_value_threshold";
 
     @VariableMgr.VarAttr(name = HOT_VALUE_THRESHOLD, needForward = true,
-                description = {"value 在每百行中的最低出现次数",
-                        "The minimum number of occurrences of 'value' per 
hundred lines"})
-    private double hotValueThreshold = 33; // by percentage
-
-    public void setHotValueThreshold(double threshold) {
+                description = {"当列中某个特定值的出现次数大于等于(rowCount/ndv)× 
hotValueThreshold 时,该值即被视为热点值",
+                        "When the occurrence of a value in a column is greater 
than "
+                                + "hotValueThreshold tmies of average 
occurences "
+                                + "(occurrences >= hotValueThreshold * 
rowCount / ndv), "
+                                + "the value is regarded as hot value"})
+    private double hotValueThreshold = 10;
+

Review Comment:
   In this PR, suppose there is a table with 10 million rows, column A has only 
two values ​​1 and 2, and column A is a join key. In this case, the hot values 
​​1 and 2 will not be recognized. But in fact, it seems that this will also 
cause join skew.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to