(doris-website) branch master updated: [doc] add string function overview (#2985)

yiguolei Fri, 17 Oct 2025 19:18:05 -0700

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new f964125e287 [doc] add string function overview (#2985)
f964125e287 is described below

commit f964125e287ac5ef62a445f461e35eceaa9d5513
Author: Mryange <[email protected]>
AuthorDate: Sat Oct 18 10:17:46 2025 +0800

    [doc] add string function overview (#2985)
    
    ## Versions
    
    - [x] dev
    - [ ] 3.x
    - [ ] 2.1
    - [ ] 2.0
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 .../scalar-functions/string-functions/overview.md  | 125 +++++++++++++++++++++
 .../scalar-functions/string-functions/overview.md  | 125 +++++++++++++++++++++
 sidebars.json                                      |   1 +
 3 files changed, 251 insertions(+)

diff --git 
a/docs/sql-manual/sql-functions/scalar-functions/string-functions/overview.md 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/overview.md
new file mode 100644
index 00000000000..07cdf9c5d77
--- /dev/null
+++ 
b/docs/sql-manual/sql-functions/scalar-functions/string-functions/overview.md
@@ -0,0 +1,125 @@
+---
+{
+    "title": "String Functions Overview",
+    "language": "en"
+}
+---
+
+# String Functions Overview
+
+String functions are built-in functions used to process and manipulate string 
data. They help perform various string operations, such as concatenation, 
splitting, replacement, searching, etc.
+
+## UTF-8 Encoding Support
+
+UTF-8 encoding is a variable-length character encoding that can represent 
almost all characters in the world, including Cyrillic, Greek, Chinese 
characters, emojis, and more.
+
+In Doris string functions, UTF-8 encoding is supported unless specifically 
noted otherwise.
+
+For example, the `substring` function can correctly handle UTF-8 encoded 
strings:
+
+### ASCII Characters
+
+```sql
+mysql> SELECT substring('abc1', 2);
++----------------------+
+| substring('abc1', 2) |
++----------------------+
+| bc1                  |
++----------------------+
+```
+
+### Greek Letters
+
+```sql
+mysql> SELECT substring('αλφαβητον', 2, 4);
++---------------------------------------+
+| substring('αλφαβητον', 2, 4)          |
++---------------------------------------+
+| λφαβ                                  |
++---------------------------------------+
+1 row in set (0.01 sec)
+```
+
+### Chinese Characters
+
+```sql
+mysql> SELECT substring('你好，世界', 2, 2);
++------------------------------------+
+| substring('你好，世界', 2, 2)      |
++------------------------------------+
+| 好，                               |
++------------------------------------+
+```
+
+### Emojis
+
+```sql
+mysql> SELECT substring('😊😊a😊 World!', 2, 3);
++-----------------------------------------+
+| substring('😊😊a😊 World!', 2, 3)     |
++-----------------------------------------+
+| 😊a😊                                  |
++-----------------------------------------+
+```
+
+## Performance Considerations
+
+Since UTF-8 encoded characters have variable lengths, there may be performance 
impacts. Some functions provide both ASCII and UTF-8 versions for selection.
+
+For example:
+- The `length` function returns the byte length of a string
+- The `char_length` function returns the character count of a string
+
+```sql
+mysql> select length('你好');
++------------------+
+| length('你好')   |
++------------------+
+|                6 |
++------------------+
+
+mysql> select length('αλφαβητον');
++------------------------------+
+| length('αλφαβητον')          |
++------------------------------+
+|                           18 |
++------------------------------+
+
+mysql> select char_length('你好');
++-----------------------+
+| char_length('你好')   |
++-----------------------+
+|                     2 |
++-----------------------+
+
+mysql> select char_length('αλφαβητον');
++-----------------------------------+
+| char_length('αλφαβητον')          |
++-----------------------------------+
+|                                 9 |
++-----------------------------------+
+```
+
+## Special Notes
+
+Some string functions that don't support UTF-8 encoding will be specifically 
mentioned in the documentation. For example, the `NGRAM_SEARCH` function only 
supports ASCII-encoded strings.
+
+```sql
+mysql> select ngram_search('abcab' , 'ab' , 2);
++----------------------------------+
+| ngram_search('abcab' , 'ab' , 2) |
++----------------------------------+
+|                              0.5 |
++----------------------------------+
+```
+
+For non-ASCII characters, `NGRAM_SEARCH` will still execute, but the results 
may not be as expected.
+
+```sql
+mysql> select ngram_search('αβγαβ' , 'αβ' , 2);
++-----------------------------------------+
+| ngram_search('αβγαβ' , 'αβ' , 2)        |
++-----------------------------------------+
+|                      0.6666666666666666 |
++-----------------------------------------+
+```
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/overview.md
new file mode 100644
index 00000000000..101a5936629
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/overview.md
@@ -0,0 +1,125 @@
+---
+{
+    "title": "字符串函数概述",
+    "language": "zh-CN"
+}
+---
+
+# 字符串函数概述
+
+字符串函数是用于处理和操作字符串数据的内置函数。它们可以帮助我们执行各种字符串操作，如连接、分割、替换、查找等。
+
+## UTF-8 编码支持
+
+UTF-8编码是一种变长的字符编码方式，可以表示世界上几乎所有的字符，包括西里尔字母、希腊字母、汉字、表情等。
+
+在Doris的字符串函数中，如果没有特殊说明，字符串都是支持UTF-8编码的。
+
+例如 `substring` 函数可以正确处理UTF-8编码的字符串：
+
+### ASCII 字符
+
+```sql
+mysql> SELECT substring('abc1', 2);
++----------------------+
+| substring('abc1', 2) |
++----------------------+
+| bc1                  |
++----------------------+
+```
+
+### 希腊字母
+
+```sql
+mysql> SELECT substring('αλφαβητον', 2, 4);
++---------------------------------------+
+| substring('αλφαβητον', 2, 4)          |
++---------------------------------------+
+| λφαβ                                  |
++---------------------------------------+
+1 row in set (0.01 sec)
+```
+
+### 汉字
+
+```sql
+mysql> SELECT substring('你好，世界', 2, 2);
++------------------------------------+
+| substring('你好，世界', 2, 2)      |
++------------------------------------+
+| 好，                               |
++------------------------------------+
+```
+
+### 表情
+
+```sql
+mysql> SELECT substring('😊😊a😊 World!', 2, 3);
++-----------------------------------------+
+| substring('😊😊a😊 World!', 2, 3)     |
++-----------------------------------------+
+| 😊a😊                                  |
++-----------------------------------------+
+```
+
+## 性能考虑
+
+因为UTF-8编码的字符长度不固定，所以在性能上会有一定的影响。一些函数提供了ASCII版本和UTF-8版本以供选择。
+
+例如：
+- `length` 函数返回字符串的字节长度
+- `char_length` 函数返回字符串的字符长度
+
+```sql
+mysql> select length('你好');
++------------------+
+| length('你好')   |
++------------------+
+|                6 |
++------------------+
+
+mysql> select length('αλφαβητον');
++------------------------------+
+| length('αλφαβητον')          |
++------------------------------+
+|                           18 |
++------------------------------+
+
+mysql> select char_length('你好');
++-----------------------+
+| char_length('你好')   |
++-----------------------+
+|                     2 |
++-----------------------+
+
+mysql> select char_length('αλφαβητον');
++-----------------------------------+
+| char_length('αλφαβητον')          |
++-----------------------------------+
+|                                 9 |
++-----------------------------------+
+```
+
+## 特殊说明
+
+一些不支持UTF-8编码的字符串函数，会在文档中进行特别说明，例如`NGRAM_SEARCH`函数只支持ASCII编码的字符串。
+
+```sql
+mysql> select ngram_search('abcab' , 'ab' , 2);
++----------------------------------+
+| ngram_search('abcab' , 'ab' , 2) |
++----------------------------------+
+|                              0.5 |
++----------------------------------+
+```
+
+对于非ASCII字符，`NGRAM_SEARCH`也会执行，但是结果会不符合预期。
+
+```sql
+mysql> select ngram_search('αβγαβ' , 'αβ' , 2);
++-----------------------------------------+
+| ngram_search('αβγαβ' , 'αβ' , 2)        |
++-----------------------------------------+
+|                      0.6666666666666666 |
++-----------------------------------------+
+```
diff --git a/sidebars.json b/sidebars.json
index d9131b4a08c..7195921ae7e 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -1247,6 +1247,7 @@
                                     "type": "category",
                                     "label": "String Functions",
                                     "items": [
+                                        
"sql-manual/sql-functions/scalar-functions/string-functions/overview",
                                         
"sql-manual/sql-functions/scalar-functions/string-functions/append-trailing-char-if-absent",
                                         
"sql-manual/sql-functions/scalar-functions/string-functions/ascii",
                                         
"sql-manual/sql-functions/scalar-functions/string-functions/auto-partition-name",


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [doc] add string function overview (#2985)

Reply via email to