This is an automated email from the ASF dual-hosted git repository.
kassiez pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new dd2d9862be4 update java-udf for 2.1 & 3.0 (#1787)
dd2d9862be4 is described below
commit dd2d9862be44d977245e0e69d2620a1a4dc2ddf6
Author: wangtianyi2004 <[email protected]>
AuthorDate: Thu Jan 16 10:17:26 2025 +0800
update java-udf for 2.1 & 3.0 (#1787)
## Versions
- [ ] dev
- [x] 3.0
- [x] 2.1
- [ ] 2.0
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
.../query-data/udf/java-user-defined-function.md | 629 +++++++++++---------
.../query-data/udf/java-user-defined-function.md | 638 ++++++++++++---------
.../query-data/udf/java-user-defined-function.md | 505 +++++++++-------
.../query-data/udf/java-user-defined-function.md | 522 ++++++++++-------
4 files changed, 1360 insertions(+), 934 deletions(-)
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query-data/udf/java-user-defined-function.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query-data/udf/java-user-defined-function.md
index 995493da487..611d4367754 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query-data/udf/java-user-defined-function.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query-data/udf/java-user-defined-function.md
@@ -1,6 +1,6 @@
---
{
-"title": "Java UDF",
+"title": "Java UDF, UDAF, UDTF",
"language": "zh-CN"
}
---
@@ -24,64 +24,15 @@ specific language governing permissions and limitations
under the License.
-->
-
-## Java UDF 介绍
-
+## 概述
Java UDF 为用户提供 UDF 编写的 Java 接口,以方便用户使用 Java 语言进行自定义函数的执行。
-
Doris 支持使用 JAVA 编写 UDF、UDAF 和 UDTF。下文如无特殊说明,使用 UDF 统称所有用户自定义函数。
-
-## 创建 UDF
-
-实现的 jar 包可以放在本地也可以存放在远程服务端通过 HTTP 下载,但必须让每个 FE 和 BE 节点都能获取到 jar 包。
-
-否则将会返回错误状态信息 `Couldn't open file ......`。
-
-更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/function/CREATE-FUNCTION).
-
-### UDF
-
-```sql
-CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
- "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
- "symbol"="org.apache.doris.udf.AddOne",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDAF
-
-```sql
-CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE
PROPERTIES (
- "file"="file:///pathTo/java-udaf.jar",
- "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDTF
-
-:::tip
-UDTF 自 Doris 3.0 版本开始支持
-:::
-
-## 使用 UDF
-
-用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。
-
-UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。
-
-当链接 Session 位于数据内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF 的数据库名字,例如
`dbName.funcName`。
-
-## 删除 UDF
-
-当你不再需要 UDF 函数时,你可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP
FUNCTION](../../sql-manual/sql-statements/function/DROP-FUNCTION)
+1. Java UDF 是较为常见的自定义标量函数 (Scalar Function),即每输入一行数据,就会有一行对应的结果输出,较为常见的有
ABS,LENGTH 等。值得一提的是对于用户来讲,Hive UDF 是可以直接迁移至 Doris 的。
+2. Java UDAF 即为自定义的聚合函数 (Aggregate Function),即在输入多行数据进行聚合后,仅输出一行对应的结果,较为常见的有
MIN,MAX,COUNT 等。
+3. JAVA UDTF 即为自定义的表函数 (Table Function),即每输一行数据,可以产生一行或多行的结果,在 Doris 中需要结合
Lateral View 使用可以达到行转列的效果,较为常见的有 EXPLODE,EXPLODE_SPLIT 等。
## 类型对应关系
-
| Doris 数据类型 | Java UDF 参数类型 |
| ---------------- | ------------------------------------------ |
| Bool | Boolean |
@@ -110,37 +61,89 @@ UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函
在创建函数时,请务必使用 `string` 类型而不是 `varchar`,否则可能会导致函数执行失败。
:::
-## UDF 的编写
+## 使用限制
+
+1. 不支持复杂数据类型(HLL,Bitmap)。
+2. 当前允许用户自己指定 JVM 最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认
1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。
+3. 由于 jvm 加载同名类的问题,不要同时使用多个同名类作为 udf 实现,如果想更新某个同名类的 udf,需要重启 be 重新加载 classpath。
-本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/`
下提供示例,可点击查看[参考示例](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
-使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了
`AddOne` UDF 来完成对整型输入进行加一的操作。
+## 快速上手
+本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/`
下提供了示例,可供参考,查看点击[这里](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
+
+UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。
+所以如果当前链接 session 位于数据库 DB 内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF
的数据库名字,例如 `dbName.funcName`。
-值得一提的是,本例不只是 Doris 支持的 Java UDF,同时还是 Hive 支持的 UDF,也就是说,对于用户来讲,Hive UDF
是可以直接迁移至 Doris 的。
+接下来的章节介绍实例,均会在`test_table` 上做测试,对应建表如下:
-另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 Static 变量,可以参照文档下方的 Static 变量加载方式。
+```sql
+CREATE TABLE `test_table` (
+ id int NULL,
+ d1 double NULL,
+ str string NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_num" = "1");
+
+insert into test_table values (1, 111.11, "a,b,c");
+insert into test_table values (6, 666.66, "d,e");
+```
+### Java-UDF 实例介绍
-### UDF
+使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了
`AddOne` UDF 来完成对整型输入进行加一的操作。
-```java
-public class AddOne extends UDF {
- public Integer evaluate(Integer value) {
- return value == null ? null : value + 1;
+1. 首先编写对应的 Java 代码,打包生成 JAR 包。
+
+ ```java
+ public class AddOne extends UDF {
+ public Integer evaluate(Integer value) {
+ return value == null ? null : value + 1;
+ }
}
-}
-```
+ ```
-### UDAF
+2. 在 Doris 中注册创建 Java-UDF 函数。更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 `required`) 和一个内部类 State,下面将以一个具体的实例来说明。
+ ```sql
+ CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
+ "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol"="org.apache.doris.udf.AddOne",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. 用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。
+ 如果想查看注册成功的对应 UDF 函数,可以使用[SHOW
FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md)
命令。
+
+ ``` sql
+ select id,java_udf_add_one(id) from test_table;
+ +------+----------------------+
+ | id | java_udf_add_one(id) |
+ +------+----------------------+
+ | 1 | 2 |
+ | 6 | 7 |
+ +------+----------------------+
+ ```
+
+4. 当不再需要 UDF 函数时,可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION.md)
+
+另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 static 变量,可以参照文档下方的 static 变量加载方式。
-**示例 1**
+### Java-UDAF 实例介绍
-下面的 SimpleDemo 将实现一个类似的 SUM 的简单函数,输入参数 INT,输出参数是 INT。
+在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 required) 和一个内部类 State,下面将以具体的实例来说明。
+
+1. 首先编写对应的 Java UDAF 代码,打包生成 JAR 包。
+
+<details>
+<summary> 示例 1: SimpleDemo 将实现一个类似的 sum 的简单函数,输入参数 INT,输出参数是 INT</summary>
```java
-package org.apache.doris.udf.demo;
+package org.apache.doris.udf;
import java.io.DataInputStream;
import java.io.DataOutputStream;
@@ -149,88 +152,80 @@ import java.util.logging.Logger;
public class SimpleDemo {
- Logger log = Logger.getLogger("SimpleDemo");
+Logger log = Logger.getLogger("SimpleDemo");
- //Need an inner class to store data
- /*required*/
- public static class State {
- /*some variables if you need */
- public int sum = 0;
- }
+//Need an inner class to store data
+/*required*/
+public static class State {
+ /*some variables if you need */
+ public int sum = 0;
+}
- /*required*/
- public State create() {
- /* here could do some init work if needed */
- return new State();
- }
+/*required*/
+public State create() {
+ /* here could do some init work if needed */
+ return new State();
+}
- /*required*/
- public void destroy(State state) {
- /* here could do some destroy work if needed */
- }
+/*required*/
+public void destroy(State state) {
+ /* here could do some destroy work if needed */
+}
- /*Not Required*/
- public void reset(State state) {
- /*if you want this udaf function can work with window function.*/
- /*Must impl this, it will be reset to init state after calculate every
window frame*/
- state.sum = 0;
- }
+/*Not Required*/
+public void reset(State state) {
+ /*if you want this udaf function can work with window function.*/
+ /*Must impl this, it will be reset to init state after calculate every
window frame*/
+ state.sum = 0;
+}
- /*required*/
- //first argument is State, then other types your input
- public void add(State state, Integer val) throws Exception {
- /* here doing update work when input data*/
- if (val != null) {
- state.sum += val;
- }
+/*required*/
+//first argument is State, then other types your input
+public void add(State state, Integer val) throws Exception {
+ /* here doing update work when input data*/
+ if (val != null) {
+ state.sum += val;
}
+}
- /*required*/
- public void serialize(State state, DataOutputStream out) {
- /* serialize some data into buffer */
- try {
- out.writeInt(state.sum);
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
- }
+/*required*/
+public void serialize(State state, DataOutputStream out) throws IOException {
+ /* serialize some data into buffer */
+ out.writeInt(state.sum);
+}
- /*required*/
- public void deserialize(State state, DataInputStream in) {
- /* deserialize get data from buffer before you put */
- int val = 0;
- try {
- val = in.readInt();
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
- state.sum = val;
- }
+/*required*/
+public void deserialize(State state, DataInputStream in) throws IOException {
+ /* deserialize get data from buffer before you put */
+ int val = in.readInt();
+ state.sum = val;
+}
- /*required*/
- public void merge(State state, State rhs) throws Exception {
- /* merge data from state */
- state.sum += rhs.sum;
- }
+/*required*/
+public void merge(State state, State rhs) throws Exception {
+ /* merge data from state */
+ state.sum += rhs.sum;
+}
- /*required*/
- //return Type you defined
- public Integer getValue(State state) throws Exception {
- /* return finally result */
- return state.sum;
- }
+/*required*/
+//return Type you defined
+public Integer getValue(State state) throws Exception {
+ /* return finally result */
+ return state.sum;
+}
}
```
-**示例 2**
+</details>
+
+
+<details>
+<summary> 示例 2: MedianUDAF 是一个计算中位数的功能,输入类型为 (DOUBLE, INT), 输出为 DOUBLE
</summary>
```java
package org.apache.doris.udf.demo;
-
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.math.BigDecimal;
@@ -239,189 +234,299 @@ import java.util.logging.Logger;
/*UDAF 计算中位数*/
public class MedianUDAF {
- Logger log = Logger.getLogger("MedianUDAF");
-
- //状态存储
- public static class State {
- //返回结果的精度
- int scale = 0;
- //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法
- boolean isFirst = true;
- //数据存储
- public StringBuilder stringBuilder;
- }
-
- //状态初始化
- public State create() {
- State state = new State();
- //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能
- state.stringBuilder = new StringBuilder(1000);
- return state;
- }
+Logger log = Logger.getLogger("MedianUDAF");
+
+//状态存储
+public static class State {
+ //返回结果的精度
+ int scale = 0;
+ //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法
+ boolean isFirst = true;
+ //数据存储
+ public StringBuilder stringBuilder;
+}
+//状态初始化
+public State create() {
+ State state = new State();
+ //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能
+ state.stringBuilder = new StringBuilder(1000);
+ return state;
+}
- //处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据
- public void add(State state, Double val, int scale) {
- try {
- if (val != null && state.isFirst) {
-
state.stringBuilder.append(scale).append(",").append(val).append(",");
- state.isFirst = false;
- } else if (val != null) {
- state.stringBuilder.append(val).append(",");
- }
- } catch (Exception e) {
- //如果不能保证一定不会异常,建议每个方法都最大化捕获异常,因为目前不支持处理 java 抛出的异常
- log.info("获取数据异常:" + e.getMessage());
- }
- }
- //处理数据完需要输出等待聚合
- public void serialize(State state, DataOutputStream out) {
- try {
- //目前暂时只提供 DataOutputStream,如果需要序列化对象可以考虑拼接字符串,转换 json,序列化成字节数组等方式
- //如果要序列化 State 对象,可能需要自己将 State 内部类实现序列化接口
- //最终都是要通过 DataOutputStream 传输
- out.writeUTF(state.stringBuilder.toString());
- } catch (Exception e) {
- log.info("序列化异常:" + e.getMessage());
- }
+//处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据
+public void add(State state, Double val, int scale) throws IOException {
+ if (val != null && state.isFirst) {
+ state.stringBuilder.append(scale).append(",").append(val).append(",");
+ state.isFirst = false;
+ } else if (val != null) {
+ state.stringBuilder.append(val).append(",");
}
+}
- //获取处理数据执行单位输出的数据
- public void deserialize(State state, DataInputStream in) {
- try {
- String string = in.readUTF();
- state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
- StringBuilder stringBuilder = new
StringBuilder(string.substring(2));
- state.stringBuilder = stringBuilder;
- } catch (Exception e) {
- log.info("反序列化异常:" + e.getMessage());
- }
- }
+//处理数据完需要输出等待聚合
+public void serialize(State state, DataOutputStream out) throws IOException {
+ //目前暂时只提供 DataOutputStream,如果需要序列化对象可以考虑拼接字符串,转换 json,序列化成字节数组等方式
+ //如果要序列化 State 对象,可能需要自己将 State 内部类实现序列化接口
+ //最终都是要通过 DataOutputStream 传输
+ out.writeUTF(state.stringBuilder.toString());
+}
- //聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例
- public void merge(State state1, State state2) {
- try {
- state1.scale = state2.scale;
- state1.stringBuilder.append(state2.stringBuilder.toString());
- } catch (Exception e) {
- log.info("合并结果异常:" + e.getMessage());
- }
- }
+//获取处理数据执行单位输出的数据
+public void deserialize(State state, DataInputStream in) throws IOException {
+ String string = in.readUTF();
+ state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
+ StringBuilder stringBuilder = new StringBuilder(string.substring(2));
+ state.stringBuilder = stringBuilder;
+}
- //对每个键合并后的数据进行并输出最终结果
- public Double getValue(State state) {
- try {
- String[] strings = state.stringBuilder.toString().split(",");
- double[] doubles = new double[strings.length + 1];
- doubles =
Arrays.stream(strings).mapToDouble(Double::parseDouble).toArray();
+//聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例
+public void merge(State state1, State state2) throws IOException {
+ state1.scale = state2.scale;
+ state1.stringBuilder.append(state2.stringBuilder.toString());
+}
- Arrays.sort(doubles);
- double n = doubles.length - 1;
- double index = n * 0.5;
+//对每个键合并后的数据进行并输出最终结果
+public Double getValue(State state) throws IOException {
+ String[] strings = state.stringBuilder.toString().split(",");
+ double[] doubles = new double[strings.length + 1];
+ doubles =
Arrays.stream(strings).mapToDouble(Double::parseDouble).toArray();
- int low = (int) Math.floor(index);
- int high = (int) Math.ceil(index);
+ Arrays.sort(doubles);
+ double n = doubles.length - 1;
+ double index = n * 0.5;
- double value = low == high ? (doubles[low] + doubles[high]) * 0.5
: doubles[high];
+ int low = (int) Math.floor(index);
+ int high = (int) Math.ceil(index);
- BigDecimal decimal = new BigDecimal(value);
- return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
- } catch (Exception e) {
- log.info("计算异常:" + e.getMessage());
- }
- return 0.0;
- }
+ double value = low == high ? (doubles[low] + doubles[high]) * 0.5 :
doubles[high];
- //每个执行单位执行完都会执行
- public void destroy(State state) {
- }
+ BigDecimal decimal = new BigDecimal(value);
+ return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
+}
+//每个执行单位执行完都会执行
+public void destroy(State state) {
}
+}
```
-## 最佳实践
-
-### Static 变量加载
+</details>
-当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次
udf.jar 包,在该 instance 结束时卸载掉 udf.jar 包。
-所以当 udf.jar 文件中需要加载一个几百 MB 的文件时,会因为并发的原因,使得占据的内存急剧增大,容易 OOM。
+2. 在 Doris 中注册创建 Java-UADF 函数。更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-解决方法是可以将资源加载代码拆分开,单独生成一个 jar 包文件,其他包直接引用该资源 jar 包。
+ ```sql
+ CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES (
+ "file"="file:///pathTo/java-udaf.jar",
+ "symbol"="org.apache.doris.udf.SimpleDemo",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
-假设已经拆分为了 DictLibrary 和 FunctionUdf 两个文件。
+3. 使用 Java-UDAF, 可以分组聚合或者聚合全部结果:
-1. 单独编译 DictLibrary 文件,使其生成一个独立的 jar 包,这样可以得到一个资源文件 DictLibrary.jar:
+ ```sql
+ select simple_demo(id) from test_table group by id;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 1 |
+ | 6 |
+ +-----------------+
+ ```
- ```shell
- javac ./DictLibrary.java
- jar -cf ./DictLibrary.jar ./DictLibrary.class
+ ```sql
+ select simple_demo(id) from test_table;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 7 |
+ +-----------------+
```
- ```java
- public class DictLibrary {
- private static HashMap<String, String> res = new HashMap<>();
-
- static {
- // suppose we built this dictionary from a certain local file.
- res.put("key1", "value1");
- res.put("key2", "value2");
- res.put("key3", "value3");
- res.put("0", "value4");
- res.put("1", "value5");
- res.put("2", "value6");
- }
+### Java-UDTF 实例介绍
+:::tip
+UDTF 自 Doris 3.0 版本开始支持
+:::
+
+1. 首先编写对应的 Java UDTF 代码,打包生成 JAR 包。
+UDTF 和 UDF 函数一样,需要用户自主实现一个 `evaluate` 方法,但是 UDTF 函数的返回值必须是 Array 类型。
- public static String evaluate(String key) {
- if (key == null) {
+ ```JAVA
+ public class UDTFStringTest {
+ public ArrayList<String> evaluate(String value, String separator) {
+ if (value == null || separator == null) {
return null;
+ } else {
+ return new ArrayList<>(Arrays.asList(value.split(separator)));
}
- return res.get(key);
}
}
```
- ```java
- public class FunctionUdf {
- public String evaluate(String key) {
- String value = DictLibrary.evaluate(key);
- return value;
+2. 在 Doris 中注册创建 Java-UDTF 函数。此时会注册两个 UTDF
函数,另外一个是在函数名后面加上`_outer`后缀,其中带后缀`_outer` 的是针对结果为 0 行时的特殊处理,具体可查看[OUTER
组合器](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md)。
+更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string>
PROPERTIES (
+ "file"="file:///pathTo/java-udtf.jar",
+ "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. 使用 Java-UDTF, 在 Doris 中使用 UDTF 需要结合 [Lateral View](../lateral-view.md),
实现行转列的效果 :
+
+ ```sql
+ select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as
e1;
+ +------+-------+------+
+ | id | str | e1 |
+ +------+-------+------+
+ | 1 | a,b,c | a |
+ | 1 | a,b,c | b |
+ | 1 | a,b,c | c |
+ | 6 | d,e | d |
+ | 6 | d,e | e |
+ +------+-------+------+
+ ```
+
+## 最佳实践
+
+*Static 变量加载*
+
+当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次
udf.jar 包,在该 Instance 结束时卸载掉 udf.jar 包。
+
+所以当 udf.jar 文件中需要加载一个几百 MB 的文件时,会因为并发的原因,使得占据的内存急剧增大,容易 OOM。
+或者想使用一个连接池时,这样无法做到仅在 static 区域初始化一次。
+
+这里提供两个解决方案,其中方案二需要 Doris 版本在 branch-3.0 以上才行。
+
+*解决方案 1:*
+
+是可以将资源加载代码拆分开,单独生成一个 JAR 包文件,然后其他包直接引用该资源 JAR 包。
+
+假设已经将代码拆分为了 DictLibrary 和 FunctionUdf 两个文件。
+
+```java
+public class DictLibrary {
+ private static HashMap<String, String> res = new HashMap<>();
+
+ static {
+ // suppose we built this dictionary from a certain local file.
+ res.put("key1", "value1");
+ res.put("key2", "value2");
+ res.put("key3", "value3");
+ res.put("0", "value4");
+ res.put("1", "value5");
+ res.put("2", "value6");
+ }
+
+ public static String evaluate(String key) {
+ if (key == null) {
+ return null;
}
+ return res.get(key);
+ }
+}
+```
+
+```java
+public class FunctionUdf {
+ public String evaluate(String key) {
+ String value = DictLibrary.evaluate(key);
+ return value;
}
+}
+```
+
+1. 单独编译 DictLibrary 文件,使其生成一个独立的 JAR 包,这样可以得到一个资源文件包 DictLibrary.jar:
+
+ ```shell
+ javac ./DictLibrary.java
+ jar -cf ./DictLibrary.jar ./DictLibrary.class
```
-2. 编译 FunctionUdf 文件,可以直接引用上一步的到的资源包,这样可以得到 UDF 的 FunctionUdf.jar 包。
+2. 编译 FunctionUdf 文件,需要引用上一步的到的资源包最为库使用,这样打包后可以得到 UDF 的 FunctionUdf.jar 包。
```shell
javac -cp ./DictLibrary.jar ./FunctionUdf.java
jar -cvf ./FunctionUdf.jar ./FunctionUdf.class
```
-3. 经过上面两步之后,会得到两个 jar 包,由于想让资源 jar 包被所有的并发引用,所以需要将它放到指定路径 `fe/custom_lib` 和
`be/custom_lib` 下面,服务重启之后就可以随着 JVM 的启动加载进来。
+3. 由于想让资源 JAR 包被所有的并发引用,所以想让它被 JVM 直接加载,可以将它放到指定路径 `be/custom_lib` 下面,BE
服务重启之后就可以随着 JVM 的启动加载进来,因此都会随着服务启动而加载,停止而释放。
-4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数
+4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数,这样每次卸载仅是 FunctionUdf.jar。
```sql
CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES (
+ "file"="file:///pathTo/FunctionUdf.jar",
"symbol"="org.apache.doris.udf.FunctionUdf",
"always_nullable"="true",
"type"="JAVA_UDF"
);
```
-使用该加载方式时,FunctionUdf.jar 和 DictLibrary.jar 都在 FE 和 BE 的 custom_lib
路径下,因此都会随着服务启动而加载,停止而释放,不再需要指定 File 的路径。
+*解决方案 2:*
-也可以使用 file:/// 方式自定义 FunctionUdf.jar 的路径,但是 DictLibrary.jar 只能放在 custom_lib 下。
+BE 全局缓存 JAR 包,自定义过期淘汰时间,在 create function 时增加两个属性字段,其中
+static_load: 用于定义是否使用静态 cache 加载的方式。
-## 使用须知
+expiration_time: 用于定义 JAR 包的过期时间,单位为分钟。
-1. 不支持复杂数据类型(HLL,Bitmap)。
+若使用静态 cache 加载方式,则在第一次调用该 UDF 函数时,在初始化之后会将该 UDF 的实例缓存起来,在下次调用该 UDF 时,首先会在
cache 中进行查找,如果没有找到,则会进行相关初始化操作。
-2. 当前允许用户自己指定 JVM 最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认
1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。
+并且后台有线程定期检查,如果在配置的过期淘汰时间内,一直没有被调用过,则会从缓存 cache 中清理掉。如果被调用时,则会自动更新缓存时间点。
-3. Char 类型的 UDF 在 `CREATE FUNCTION` 时需要使用 String 类型。
+```sql
+public class Print extends UDF {
+ static Integer val = 0;
+ public Integer evaluate() {
+ val = val + 1;
+ return val;
+ }
+}
+```
-4. 由于 JVM 加载同名类的问题,不要同时使用多个同名类作为 UDF 实现,如果想更新某个同名类的 UDF,需要重启 BE 重新加载 Classpath。
+```sql
+CREATE FUNCTION print_12() RETURNS int
+PROPERTIES (
+ "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol" = "org.apache.doris.udf.Print",
+ "always_nullable"="true",
+ "type" = "JAVA_UDF",
+ "static_load" = "true", // default value is false
+ "expiration_time" = "60" // default value is 360 minutes
+);
+```
+可以看到结果是一直在递增的,证明加载的 JAR 包没有被卸载后又加载,导致重新初始化变量为 0.
+```sql
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 1 |
++------------+
+1 row in set (0.40 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 2 |
++------------+
+1 row in set (0.03 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 3 |
++------------+
+1 row in set (0.04 sec)
+```
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/query-data/udf/java-user-defined-function.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/query-data/udf/java-user-defined-function.md
index 01940c1b083..611d4367754 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/query-data/udf/java-user-defined-function.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/query-data/udf/java-user-defined-function.md
@@ -1,6 +1,6 @@
---
{
-"title": "Java UDF",
+"title": "Java UDF, UDAF, UDTF",
"language": "zh-CN"
}
---
@@ -24,73 +24,15 @@ specific language governing permissions and limitations
under the License.
-->
-
-## Java UDF 介绍
-
+## 概述
Java UDF 为用户提供 UDF 编写的 Java 接口,以方便用户使用 Java 语言进行自定义函数的执行。
-
Doris 支持使用 JAVA 编写 UDF、UDAF 和 UDTF。下文如无特殊说明,使用 UDF 统称所有用户自定义函数。
-
-## 创建 UDF
-
-实现的 jar 包可以放在本地也可以存放在远程服务端通过 HTTP 下载,但必须让每个 FE 和 BE 节点都能获取到 jar 包。
-
-否则将会返回错误状态信息 `Couldn't open file ......`。
-
-更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/function/CREATE-FUNCTION).
-
-### UDF
-
-```sql
-CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
- "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
- "symbol"="org.apache.doris.udf.AddOne",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDAF
-
-```sql
-CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE
PROPERTIES (
- "file"="file:///pathTo/java-udaf.jar",
- "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDTF
-
-:::tip
-UDTF 自 Doris 3.0 版本开始支持
-:::
-
-```sql
-CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string>
PROPERTIES (
- "file"="file:///pathTo/java-udtf.jar",
- "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-## 使用 UDF
-
-用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。
-
-UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。
-
-当链接 Session 位于数据内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF 的数据库名字,例如
`dbName.funcName`。
-
-## 删除 UDF
-
-当你不再需要 UDF 函数时,你可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP
FUNCTION](../../sql-manual/sql-statements/function/DROP-FUNCTION)
+1. Java UDF 是较为常见的自定义标量函数 (Scalar Function),即每输入一行数据,就会有一行对应的结果输出,较为常见的有
ABS,LENGTH 等。值得一提的是对于用户来讲,Hive UDF 是可以直接迁移至 Doris 的。
+2. Java UDAF 即为自定义的聚合函数 (Aggregate Function),即在输入多行数据进行聚合后,仅输出一行对应的结果,较为常见的有
MIN,MAX,COUNT 等。
+3. JAVA UDTF 即为自定义的表函数 (Table Function),即每输一行数据,可以产生一行或多行的结果,在 Doris 中需要结合
Lateral View 使用可以达到行转列的效果,较为常见的有 EXPLODE,EXPLODE_SPLIT 等。
## 类型对应关系
-
| Doris 数据类型 | Java UDF 参数类型 |
| ---------------- | ------------------------------------------ |
| Bool | Boolean |
@@ -119,37 +61,89 @@ UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函
在创建函数时,请务必使用 `string` 类型而不是 `varchar`,否则可能会导致函数执行失败。
:::
-## UDF 的编写
+## 使用限制
-本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/`
下提供示例,可点击查看[参考示例](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
+1. 不支持复杂数据类型(HLL,Bitmap)。
+2. 当前允许用户自己指定 JVM 最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认
1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。
+3. 由于 jvm 加载同名类的问题,不要同时使用多个同名类作为 udf 实现,如果想更新某个同名类的 udf,需要重启 be 重新加载 classpath。
-使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了
`AddOne` UDF 来完成对整型输入进行加一的操作。
-值得一提的是,本例不只是 Doris 支持的 Java UDF,同时还是 Hive 支持的 UDF,也就是说,对于用户来讲,Hive UDF
是可以直接迁移至 Doris 的。
+## 快速上手
+本小节主要介绍如何开发一个 Java UDF。在 `samples/doris-demo/java-udf-demo/`
下提供了示例,可供参考,查看点击[这里](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
+
+UDF 的使用与普通的函数方式一致,唯一的区别在于,内置函数的作用域是全局的,而 UDF 的作用域是 DB 内部。
+所以如果当前链接 session 位于数据库 DB 内部时,直接使用 UDF 名字会在当前 DB 内部查找对应的 UDF。否则用户需要显示的指定 UDF
的数据库名字,例如 `dbName.funcName`。
-另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 Static 变量,可以参照文档下方的 Static 变量加载方式。
+接下来的章节介绍实例,均会在`test_table` 上做测试,对应建表如下:
+```sql
+CREATE TABLE `test_table` (
+ id int NULL,
+ d1 double NULL,
+ str string NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_num" = "1");
+
+insert into test_table values (1, 111.11, "a,b,c");
+insert into test_table values (6, 666.66, "d,e");
+```
-### UDF
+### Java-UDF 实例介绍
-```java
-public class AddOne extends UDF {
- public Integer evaluate(Integer value) {
- return value == null ? null : value + 1;
+使用 Java 代码编写 UDF,UDF 的主入口必须为 `evaluate` 函数。这一点与 Hive 等其他引擎保持一致。在本示例中,我们编写了
`AddOne` UDF 来完成对整型输入进行加一的操作。
+
+1. 首先编写对应的 Java 代码,打包生成 JAR 包。
+
+ ```java
+ public class AddOne extends UDF {
+ public Integer evaluate(Integer value) {
+ return value == null ? null : value + 1;
+ }
}
-}
-```
+ ```
+
+2. 在 Doris 中注册创建 Java-UDF 函数。更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
+ "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol"="org.apache.doris.udf.AddOne",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. 用户使用 UDF 必须拥有对应数据库的 `SELECT` 权限。
+ 如果想查看注册成功的对应 UDF 函数,可以使用[SHOW
FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md)
命令。
+
+ ``` sql
+ select id,java_udf_add_one(id) from test_table;
+ +------+----------------------+
+ | id | java_udf_add_one(id) |
+ +------+----------------------+
+ | 1 | 2 |
+ | 6 | 7 |
+ +------+----------------------+
+ ```
+
+4. 当不再需要 UDF 函数时,可以通过下述命令来删除一个 UDF 函数,可以参考 [DROP
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION.md)
+
+另外,如果定义的 UDF 中需要加载很大的资源文件,或者希望可以定义全局的 static 变量,可以参照文档下方的 static 变量加载方式。
-### UDAF
+### Java-UDAF 实例介绍
-在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 `required`) 和一个内部类 State,下面将以一个具体的实例来说明。
+在使用 Java 代码编写 UDAF 时,有一些必须实现的函数 (标记 required) 和一个内部类 State,下面将以具体的实例来说明。
-**示例 1**
+1. 首先编写对应的 Java UDAF 代码,打包生成 JAR 包。
-下面的 SimpleDemo 将实现一个类似的 SUM 的简单函数,输入参数 INT,输出参数是 INT。
+<details>
+<summary> 示例 1: SimpleDemo 将实现一个类似的 sum 的简单函数,输入参数 INT,输出参数是 INT</summary>
```java
-package org.apache.doris.udf.demo;
+package org.apache.doris.udf;
import java.io.DataInputStream;
import java.io.DataOutputStream;
@@ -158,88 +152,80 @@ import java.util.logging.Logger;
public class SimpleDemo {
- Logger log = Logger.getLogger("SimpleDemo");
+Logger log = Logger.getLogger("SimpleDemo");
- //Need an inner class to store data
- /*required*/
- public static class State {
- /*some variables if you need */
- public int sum = 0;
- }
+//Need an inner class to store data
+/*required*/
+public static class State {
+ /*some variables if you need */
+ public int sum = 0;
+}
- /*required*/
- public State create() {
- /* here could do some init work if needed */
- return new State();
- }
+/*required*/
+public State create() {
+ /* here could do some init work if needed */
+ return new State();
+}
- /*required*/
- public void destroy(State state) {
- /* here could do some destroy work if needed */
- }
+/*required*/
+public void destroy(State state) {
+ /* here could do some destroy work if needed */
+}
- /*Not Required*/
- public void reset(State state) {
- /*if you want this udaf function can work with window function.*/
- /*Must impl this, it will be reset to init state after calculate every
window frame*/
- state.sum = 0;
- }
+/*Not Required*/
+public void reset(State state) {
+ /*if you want this udaf function can work with window function.*/
+ /*Must impl this, it will be reset to init state after calculate every
window frame*/
+ state.sum = 0;
+}
- /*required*/
- //first argument is State, then other types your input
- public void add(State state, Integer val) throws Exception {
- /* here doing update work when input data*/
- if (val != null) {
- state.sum += val;
- }
+/*required*/
+//first argument is State, then other types your input
+public void add(State state, Integer val) throws Exception {
+ /* here doing update work when input data*/
+ if (val != null) {
+ state.sum += val;
}
+}
- /*required*/
- public void serialize(State state, DataOutputStream out) {
- /* serialize some data into buffer */
- try {
- out.writeInt(state.sum);
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
- }
+/*required*/
+public void serialize(State state, DataOutputStream out) throws IOException {
+ /* serialize some data into buffer */
+ out.writeInt(state.sum);
+}
- /*required*/
- public void deserialize(State state, DataInputStream in) {
- /* deserialize get data from buffer before you put */
- int val = 0;
- try {
- val = in.readInt();
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
- state.sum = val;
- }
+/*required*/
+public void deserialize(State state, DataInputStream in) throws IOException {
+ /* deserialize get data from buffer before you put */
+ int val = in.readInt();
+ state.sum = val;
+}
- /*required*/
- public void merge(State state, State rhs) throws Exception {
- /* merge data from state */
- state.sum += rhs.sum;
- }
+/*required*/
+public void merge(State state, State rhs) throws Exception {
+ /* merge data from state */
+ state.sum += rhs.sum;
+}
- /*required*/
- //return Type you defined
- public Integer getValue(State state) throws Exception {
- /* return finally result */
- return state.sum;
- }
+/*required*/
+//return Type you defined
+public Integer getValue(State state) throws Exception {
+ /* return finally result */
+ return state.sum;
+}
}
```
-**示例 2**
+</details>
+
+
+<details>
+<summary> 示例 2: MedianUDAF 是一个计算中位数的功能,输入类型为 (DOUBLE, INT), 输出为 DOUBLE
</summary>
```java
package org.apache.doris.udf.demo;
-
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.math.BigDecimal;
@@ -248,189 +234,299 @@ import java.util.logging.Logger;
/*UDAF 计算中位数*/
public class MedianUDAF {
- Logger log = Logger.getLogger("MedianUDAF");
-
- //状态存储
- public static class State {
- //返回结果的精度
- int scale = 0;
- //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法
- boolean isFirst = true;
- //数据存储
- public StringBuilder stringBuilder;
- }
+Logger log = Logger.getLogger("MedianUDAF");
+
+//状态存储
+public static class State {
+ //返回结果的精度
+ int scale = 0;
+ //是否是某一个 tablet 下的某个聚合条件下的数据第一次执行 add 方法
+ boolean isFirst = true;
+ //数据存储
+ public StringBuilder stringBuilder;
+}
- //状态初始化
- public State create() {
- State state = new State();
- //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能
- state.stringBuilder = new StringBuilder(1000);
- return state;
- }
+//状态初始化
+public State create() {
+ State state = new State();
+ //根据每个 tablet 下的聚合条件需要聚合的数据量大小,预先初始化,增加性能
+ state.stringBuilder = new StringBuilder(1000);
+ return state;
+}
- //处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据
- public void add(State state, Double val, int scale) {
- try {
- if (val != null && state.isFirst) {
-
state.stringBuilder.append(scale).append(",").append(val).append(",");
- state.isFirst = false;
- } else if (val != null) {
- state.stringBuilder.append(val).append(",");
- }
- } catch (Exception e) {
- //如果不能保证一定不会异常,建议每个方法都最大化捕获异常,因为目前不支持处理 java 抛出的异常
- log.info("获取数据异常:" + e.getMessage());
- }
- }
-
- //处理数据完需要输出等待聚合
- public void serialize(State state, DataOutputStream out) {
- try {
- //目前暂时只提供 DataOutputStream,如果需要序列化对象可以考虑拼接字符串,转换 json,序列化成字节数组等方式
- //如果要序列化 State 对象,可能需要自己将 State 内部类实现序列化接口
- //最终都是要通过 DataOutputStream 传输
- out.writeUTF(state.stringBuilder.toString());
- } catch (Exception e) {
- log.info("序列化异常:" + e.getMessage());
- }
+//处理执行单位处理各自 tablet 下的各自聚合条件下的每个数据
+public void add(State state, Double val, int scale) throws IOException {
+ if (val != null && state.isFirst) {
+ state.stringBuilder.append(scale).append(",").append(val).append(",");
+ state.isFirst = false;
+ } else if (val != null) {
+ state.stringBuilder.append(val).append(",");
}
+}
- //获取处理数据执行单位输出的数据
- public void deserialize(State state, DataInputStream in) {
- try {
- String string = in.readUTF();
- state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
- StringBuilder stringBuilder = new
StringBuilder(string.substring(2));
- state.stringBuilder = stringBuilder;
- } catch (Exception e) {
- log.info("反序列化异常:" + e.getMessage());
- }
- }
+//处理数据完需要输出等待聚合
+public void serialize(State state, DataOutputStream out) throws IOException {
+ //目前暂时只提供 DataOutputStream,如果需要序列化对象可以考虑拼接字符串,转换 json,序列化成字节数组等方式
+ //如果要序列化 State 对象,可能需要自己将 State 内部类实现序列化接口
+ //最终都是要通过 DataOutputStream 传输
+ out.writeUTF(state.stringBuilder.toString());
+}
- //聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例
- public void merge(State state1, State state2) {
- try {
- state1.scale = state2.scale;
- state1.stringBuilder.append(state2.stringBuilder.toString());
- } catch (Exception e) {
- log.info("合并结果异常:" + e.getMessage());
- }
- }
+//获取处理数据执行单位输出的数据
+public void deserialize(State state, DataInputStream in) throws IOException {
+ String string = in.readUTF();
+ state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
+ StringBuilder stringBuilder = new StringBuilder(string.substring(2));
+ state.stringBuilder = stringBuilder;
+}
- //对每个键合并后的数据进行并输出最终结果
- public Double getValue(State state) {
- try {
- String[] strings = state.stringBuilder.toString().split(",");
- double[] doubles = new double[strings.length + 1];
- doubles =
Arrays.stream(strings).mapToDouble(Double::parseDouble).toArray();
+//聚合执行单位按照聚合条件合并某一个键下数据的处理结果 ,每个键第一次合并时,state1 参数是初始化的实例
+public void merge(State state1, State state2) throws IOException {
+ state1.scale = state2.scale;
+ state1.stringBuilder.append(state2.stringBuilder.toString());
+}
- Arrays.sort(doubles);
- double n = doubles.length - 1;
- double index = n * 0.5;
+//对每个键合并后的数据进行并输出最终结果
+public Double getValue(State state) throws IOException {
+ String[] strings = state.stringBuilder.toString().split(",");
+ double[] doubles = new double[strings.length + 1];
+ doubles =
Arrays.stream(strings).mapToDouble(Double::parseDouble).toArray();
- int low = (int) Math.floor(index);
- int high = (int) Math.ceil(index);
+ Arrays.sort(doubles);
+ double n = doubles.length - 1;
+ double index = n * 0.5;
- double value = low == high ? (doubles[low] + doubles[high]) * 0.5
: doubles[high];
+ int low = (int) Math.floor(index);
+ int high = (int) Math.ceil(index);
- BigDecimal decimal = new BigDecimal(value);
- return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
- } catch (Exception e) {
- log.info("计算异常:" + e.getMessage());
- }
- return 0.0;
- }
+ double value = low == high ? (doubles[low] + doubles[high]) * 0.5 :
doubles[high];
- //每个执行单位执行完都会执行
- public void destroy(State state) {
- }
+ BigDecimal decimal = new BigDecimal(value);
+ return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
+}
+//每个执行单位执行完都会执行
+public void destroy(State state) {
}
+}
```
-## 最佳实践
+</details>
-### Static 变量加载
-当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次
udf.jar 包,在该 instance 结束时卸载掉 udf.jar 包。
+2. 在 Doris 中注册创建 Java-UADF 函数。更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-所以当 udf.jar 文件中需要加载一个几百 MB 的文件时,会因为并发的原因,使得占据的内存急剧增大,容易 OOM。
-
-解决方法是可以将资源加载代码拆分开,单独生成一个 jar 包文件,其他包直接引用该资源 jar 包。
+ ```sql
+ CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES (
+ "file"="file:///pathTo/java-udaf.jar",
+ "symbol"="org.apache.doris.udf.SimpleDemo",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
-假设已经拆分为了 DictLibrary 和 FunctionUdf 两个文件。
+3. 使用 Java-UDAF, 可以分组聚合或者聚合全部结果:
-1. 单独编译 DictLibrary 文件,使其生成一个独立的 jar 包,这样可以得到一个资源文件 DictLibrary.jar:
+ ```sql
+ select simple_demo(id) from test_table group by id;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 1 |
+ | 6 |
+ +-----------------+
+ ```
- ```shell
- javac ./DictLibrary.java
- jar -cf ./DictLibrary.jar ./DictLibrary.class
+ ```sql
+ select simple_demo(id) from test_table;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 7 |
+ +-----------------+
```
- ```java
- public class DictLibrary {
- private static HashMap<String, String> res = new HashMap<>();
-
- static {
- // suppose we built this dictionary from a certain local file.
- res.put("key1", "value1");
- res.put("key2", "value2");
- res.put("key3", "value3");
- res.put("0", "value4");
- res.put("1", "value5");
- res.put("2", "value6");
- }
+### Java-UDTF 实例介绍
+:::tip
+UDTF 自 Doris 3.0 版本开始支持
+:::
+
+1. 首先编写对应的 Java UDTF 代码,打包生成 JAR 包。
+UDTF 和 UDF 函数一样,需要用户自主实现一个 `evaluate` 方法,但是 UDTF 函数的返回值必须是 Array 类型。
- public static String evaluate(String key) {
- if (key == null) {
+ ```JAVA
+ public class UDTFStringTest {
+ public ArrayList<String> evaluate(String value, String separator) {
+ if (value == null || separator == null) {
return null;
+ } else {
+ return new ArrayList<>(Arrays.asList(value.split(separator)));
}
- return res.get(key);
}
}
```
- ```java
- public class FunctionUdf {
- public String evaluate(String key) {
- String value = DictLibrary.evaluate(key);
- return value;
+2. 在 Doris 中注册创建 Java-UDTF 函数。此时会注册两个 UTDF
函数,另外一个是在函数名后面加上`_outer`后缀,其中带后缀`_outer` 的是针对结果为 0 行时的特殊处理,具体可查看[OUTER
组合器](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md)。
+更多语法帮助可参阅 [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string>
PROPERTIES (
+ "file"="file:///pathTo/java-udtf.jar",
+ "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. 使用 Java-UDTF, 在 Doris 中使用 UDTF 需要结合 [Lateral View](../lateral-view.md),
实现行转列的效果 :
+
+ ```sql
+ select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as
e1;
+ +------+-------+------+
+ | id | str | e1 |
+ +------+-------+------+
+ | 1 | a,b,c | a |
+ | 1 | a,b,c | b |
+ | 1 | a,b,c | c |
+ | 6 | d,e | d |
+ | 6 | d,e | e |
+ +------+-------+------+
+ ```
+
+## 最佳实践
+
+*Static 变量加载*
+
+当前在 Doris 中,执行一个 UDF 函数,例如 `select udf(col) from table`, 每一个并发 Instance 会加载一次
udf.jar 包,在该 Instance 结束时卸载掉 udf.jar 包。
+
+所以当 udf.jar 文件中需要加载一个几百 MB 的文件时,会因为并发的原因,使得占据的内存急剧增大,容易 OOM。
+或者想使用一个连接池时,这样无法做到仅在 static 区域初始化一次。
+
+这里提供两个解决方案,其中方案二需要 Doris 版本在 branch-3.0 以上才行。
+
+*解决方案 1:*
+
+是可以将资源加载代码拆分开,单独生成一个 JAR 包文件,然后其他包直接引用该资源 JAR 包。
+
+假设已经将代码拆分为了 DictLibrary 和 FunctionUdf 两个文件。
+
+```java
+public class DictLibrary {
+ private static HashMap<String, String> res = new HashMap<>();
+
+ static {
+ // suppose we built this dictionary from a certain local file.
+ res.put("key1", "value1");
+ res.put("key2", "value2");
+ res.put("key3", "value3");
+ res.put("0", "value4");
+ res.put("1", "value5");
+ res.put("2", "value6");
+ }
+
+ public static String evaluate(String key) {
+ if (key == null) {
+ return null;
}
+ return res.get(key);
}
+}
+```
+
+```java
+public class FunctionUdf {
+ public String evaluate(String key) {
+ String value = DictLibrary.evaluate(key);
+ return value;
+ }
+}
+```
+
+1. 单独编译 DictLibrary 文件,使其生成一个独立的 JAR 包,这样可以得到一个资源文件包 DictLibrary.jar:
+
+ ```shell
+ javac ./DictLibrary.java
+ jar -cf ./DictLibrary.jar ./DictLibrary.class
```
-2. 编译 FunctionUdf 文件,可以直接引用上一步的到的资源包,这样可以得到 UDF 的 FunctionUdf.jar 包。
+2. 编译 FunctionUdf 文件,需要引用上一步的到的资源包最为库使用,这样打包后可以得到 UDF 的 FunctionUdf.jar 包。
```shell
javac -cp ./DictLibrary.jar ./FunctionUdf.java
jar -cvf ./FunctionUdf.jar ./FunctionUdf.class
```
-3. 经过上面两步之后,会得到两个 jar 包,由于想让资源 jar 包被所有的并发引用,所以需要将它放到指定路径 `fe/custom_lib` 和
`be/custom_lib` 下面,服务重启之后就可以随着 JVM 的启动加载进来。
+3. 由于想让资源 JAR 包被所有的并发引用,所以想让它被 JVM 直接加载,可以将它放到指定路径 `be/custom_lib` 下面,BE
服务重启之后就可以随着 JVM 的启动加载进来,因此都会随着服务启动而加载,停止而释放。
-4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数
+4. 最后利用 `CREATE FUNCTION` 语句创建一个 UDF 函数,这样每次卸载仅是 FunctionUdf.jar。
```sql
CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES (
+ "file"="file:///pathTo/FunctionUdf.jar",
"symbol"="org.apache.doris.udf.FunctionUdf",
"always_nullable"="true",
"type"="JAVA_UDF"
);
```
-使用该加载方式时,FunctionUdf.jar 和 DictLibrary.jar 都在 FE 和 BE 的 custom_lib
路径下,因此都会随着服务启动而加载,停止而释放,不再需要指定 File 的路径。
+*解决方案 2:*
-也可以使用 file:/// 方式自定义 FunctionUdf.jar 的路径,但是 DictLibrary.jar 只能放在 custom_lib 下。
+BE 全局缓存 JAR 包,自定义过期淘汰时间,在 create function 时增加两个属性字段,其中
+static_load: 用于定义是否使用静态 cache 加载的方式。
-## 使用须知
+expiration_time: 用于定义 JAR 包的过期时间,单位为分钟。
-1. 不支持复杂数据类型(HLL,Bitmap)。
+若使用静态 cache 加载方式,则在第一次调用该 UDF 函数时,在初始化之后会将该 UDF 的实例缓存起来,在下次调用该 UDF 时,首先会在
cache 中进行查找,如果没有找到,则会进行相关初始化操作。
-2. 当前允许用户自己指定 JVM 最大堆大小,配置项是 be.conf 中的 `JAVA_OPTS` 的 -Xmx 部分。默认
1024m,如果需要聚合数据,建议调大一些,增加性能,减少内存溢出风险。
+并且后台有线程定期检查,如果在配置的过期淘汰时间内,一直没有被调用过,则会从缓存 cache 中清理掉。如果被调用时,则会自动更新缓存时间点。
-3. Char 类型的 UDF 在 `CREATE FUNCTION` 时需要使用 String 类型。
+```sql
+public class Print extends UDF {
+ static Integer val = 0;
+ public Integer evaluate() {
+ val = val + 1;
+ return val;
+ }
+}
+```
-4. 由于 JVM 加载同名类的问题,不要同时使用多个同名类作为 UDF 实现,如果想更新某个同名类的 UDF,需要重启 BE 重新加载 Classpath。
+```sql
+CREATE FUNCTION print_12() RETURNS int
+PROPERTIES (
+ "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol" = "org.apache.doris.udf.Print",
+ "always_nullable"="true",
+ "type" = "JAVA_UDF",
+ "static_load" = "true", // default value is false
+ "expiration_time" = "60" // default value is 360 minutes
+);
+```
+可以看到结果是一直在递增的,证明加载的 JAR 包没有被卸载后又加载,导致重新初始化变量为 0.
+```sql
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 1 |
++------------+
+1 row in set (0.40 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 2 |
++------------+
+1 row in set (0.03 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 3 |
++------------+
+1 row in set (0.04 sec)
+```
diff --git
a/versioned_docs/version-2.1/query-data/udf/java-user-defined-function.md
b/versioned_docs/version-2.1/query-data/udf/java-user-defined-function.md
index df5cf6c63a4..3c0d91b2c6b 100644
--- a/versioned_docs/version-2.1/query-data/udf/java-user-defined-function.md
+++ b/versioned_docs/version-2.1/query-data/udf/java-user-defined-function.md
@@ -1,6 +1,6 @@
---
{
-"title": "Java UDF",
+"title": "Java UDF, UDAF, UDTF",
"language": "en"
}
---
@@ -25,59 +25,15 @@ under the License.
-->
-## Introduction to Java UDF
+## Overview
+Java UDF provides a Java interface for users to implement user-defined
functions (UDFs) conveniently using the Java programming language.
+Doris supports the use of Java to develop UDFs, UDAFs, and UDTFs. Unless
otherwise specified, "UDF" in the following text refers to all types of
user-defined functions.
-Java UDF provides users with a Java interface for writing UDFs, making it
convenient for users to execute custom functions using the Java language.
+1. Java UDF: A Java UDF is a commonly used scalar function, where each input
row produces a corresponding output row. Common examples include ABS and
LENGTH. Notably, Hive UDFs can be directly migrated to Doris, which is
convenient for users.
-Doris supports writing UDFs, UDAFs, and UDTFs using JAVA. Unless otherwise
specified, UDF is used as a general term for all user-defined functions in the
following text.
+2. Java UDAF: A Java UDAF is a user-defined aggregate function that aggregates
multiple input rows into a single output row. Common examples include MIN, MAX,
and COUNT.
-## Creating UDF
-
-The implemented jar package can be placed locally or stored on a remote server
for download via HTTP, but each FE and BE node must be able to access the jar
package.
-
-Otherwise, an error message `Couldn't open file ......` will be returned.
-
-For more syntax help, refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/function/CREATE-FUNCTION.md).
-
-### UDF
-
-```sql
-CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
- "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
- "symbol"="org.apache.doris.udf.AddOne",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDAF
-
-```sql
-CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE
PROPERTIES (
- "file"="file:///pathTo/java-udaf.jar",
- "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDTF
-
-:::tip
-UDTF is supported starting from Doris version 3.0.
-:::
-
-## Using UDF
-
-To utilize UDFs, users must possess the `SELECT` privilege for the
corresponding database.
-
-The usage of UDFs is identical to standard functions, with the primary
distinction being that built-in functions have a global scope, while UDFs are
scoped within the DB.
-
-When the session is linked within the database, directly using the UDF name
will search for the corresponding UDF within the current DB. Otherwise, users
must explicitly specify the UDF's database name, for example, `dbName.funcName`.
-
-## Dropping UDF
-
-If a UDF is no longer needed, it can be dropped using the following command,
as detailed in [DROP
FUNCTION](../../sql-manual/sql-statements/function/DROP-FUNCTION).
+3. Java UDTF: A Java UDTF is a user-defined table function, where a single
input row can generate one or multiple output rows. In Doris, UDTFs must be
used with Lateral View to achieve row-to-column transformations. Common
examples include EXPLODE and EXPLODE_SPLIT.
## Type Correspondence
@@ -108,36 +64,92 @@ If a UDF is no longer needed, it can be dropped using the
following command, as
When creating functions, avoid using `varchar` in place of `string`, as this
may cause the function to fail.
:::
-## Writing UDF
+## Usage Notes
+
+1. Complex data types (HLL, Bitmap) are not supported.
+
+2. Users are currently allowed to specify the maximum JVM heap size. The
configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default
is 1024m. If you need to aggregate data, it is recommended to increase this
value to enhance performance and reduce the risk of memory overflow.
+
+3. Due to issues with JVM loading classes with the same name, do not use
multiple classes with the same name as UDF implementations simultaneously. If
you want to update a UDF with a class of the same name, you need to restart BE
to reload the classpath.
+
+
+## Getting Started
This section mainly introduces how to develop a Java UDF. Examples are
provided in `samples/doris-demo/java-udf-demo/` for reference. Click
[here](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
to view details.
-When writing a UDF in Java, the main entry point must be the `evaluate`
function. This is consistent with other engines like Hive. In this example, we
write an `AddOne` UDF to perform an increment operation on integer inputs.
+The usage of UDFs is identical to standard functions, with the primary
distinction being that built-in functions have a global scope, while UDFs are
scoped within the DB.
+
+When the session is linked within the database, directly using the UDF name
will search for the corresponding UDF within the current DB. Otherwise, users
must explicitly specify the UDF's database name, for example, `dbName.funcName`.
+
+In the following sections, examples will use the table `test_table`. The
corresponding table creation script is as follows:
-It is worth mentioning that this example not only supports Java UDFs in Doris
but is also a UDF supported by Hive. This means that Hive UDFs can be directly
migrated to Doris.
+```sql
+CREATE TABLE `test_table` (
+ id int NULL,
+ d1 double NULL,
+ str string NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_num" = "1");
+
+insert into test_table values (1, 111.11, "a,b,c");
+insert into test_table values (6, 666.66, "d,e");
+```
-Additionally, if the UDF being defined needs to load large resource files or
if you want to define global static variables, you can refer to the static
variable loading method described at the bottom of the document.
-### UDF
+### Introduction to Java-UDF Example
+When writing a UDF in Java, the main entry point must be the `evaluate`
function. This is consistent with other engines like Hive. In this example, we
write an `AddOne` UDF to perform an increment operation on integer inputs.
-```java
-public class AddOne extends UDF {
- public Integer evaluate(Integer value) {
- return value == null ? null : value + 1;
+1. Write the corresponding Java code and package it into a JAR file.
+
+ ```java
+ public class AddOne extends UDF {
+ public Integer evaluate(Integer value) {
+ return value == null ? null : value + 1;
+ }
}
-}
-```
+ ```
-### UDAF
+2. Register and create the Java-UDF function in Doris. For more details on the
syntax, refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-When writing a UDAF using Java code, there are some required functions (marked
as required) and an inner class State that must be implemented. Below is a
specific example to illustrate.
+ ```sql
+ CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
+ "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol"="org.apache.doris.udf.AddOne",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. To utilize UDFs, users must possess the `SELECT` privilege for the
corresponding database. And to verify the successful registration of the UDF,
you can use the [SHOW
FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md)
command.
+
+ ``` sql
+ select id,java_udf_add_one(id) from test_table;
+ +------+----------------------+
+ | id | java_udf_add_one(id) |
+ +------+----------------------+
+ | 1 | 2 |
+ | 6 | 7 |
+ +------+----------------------+
+ ```
-**Example 1**
+4. If a UDF is no longer needed, it can be dropped using the following
command, as detailed in [DROP
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION).
-The following SimpleDemo will implement a simple function similar to sum, with
the input parameter being INT and the output parameter being INT.
+Additionally, if your UDF requires loading large resource files or defining
global static variables, you can refer to the method for loading static
variables described later in this document.
+
+### Introduction to Java-UDAF Example
+
+When writing a `UDAF` using Java, there are some functions that must be
implemented (marked as required) along with an internal class State. The
following example will illustrate how to implement them.
+
+1. Write the corresponding Java UDAF code and package it into a JAR file.
+
+<details>
+<summary> Example 1: SimpleDemo will implement a simple function similar to
sum, where the input parameter is INT and the output parameter is
INT.</summary>
```java
-package org.apache.doris.udf.demo;
+package org.apache.doris.udf;
import java.io.DataInputStream;
import java.io.DataOutputStream;
@@ -183,26 +195,15 @@ public class SimpleDemo {
}
/*required*/
- public void serialize(State state, DataOutputStream out) {
+ public void serialize(State state, DataOutputStream out) throws Exception {
/* serialize some data into buffer */
- try {
- out.writeInt(state.sum);
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
+ out.writeInt(state.sum);
}
/*required*/
- public void deserialize(State state, DataInputStream in) {
+ public void deserialize(State state, DataInputStream in) throws Exception {
/* deserialize get data from buffer before you put */
- int val = 0;
- try {
- val = in.readInt();
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
+ int val = in.readInt();
state.sum = val;
}
@@ -219,24 +220,28 @@ public class SimpleDemo {
return state.sum;
}
}
-
```
-**Example 2**
+</details>
+
+
+<details>
+<summary> Example 2: MedianUDAF is a function that calculates the median. The
input types are (DOUBLE, INT), and the output type is DOUBLE. </summary>
```java
package org.apache.doris.udf.demo;
-
+
import java.io.DataInputStream;
-import java.io.DataOutputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
import java.math.BigDecimal;
import java.util.Arrays;
import java.util.logging.Logger;
-
+
/* UDAF to calculate the median */
public class MedianUDAF {
Logger log = Logger.getLogger("MedianUDAF");
-
+
// State storage
public static class State {
// Precision of the return result
@@ -246,7 +251,7 @@ public class MedianUDAF {
// Data storage
public StringBuilder stringBuilder;
}
-
+
// Initialize the state
public State create() {
State state = new State();
@@ -254,138 +259,206 @@ public class MedianUDAF {
state.stringBuilder = new StringBuilder(1000);
return state;
}
-
+
// Process each data under respective aggregation conditions for each
tablet
public void add(State state, Double val, int scale) {
- try {
- if (val != null && state.isFirst) {
-
state.stringBuilder.append(scale).append(",").append(val).append(",");
- state.isFirst = false;
- } else if (val != null) {
- state.stringBuilder.append(val).append(",");
- }
- } catch (Exception e) {
- // If it cannot be guaranteed that there will be no exceptions, it
is recommended to maximize exception capture in each method, as handling of
exceptions thrown by Java is currently not supported
- log.info("Data acquisition exception: " + e.getMessage());
+ if (val != null && state.isFirst) {
+
state.stringBuilder.append(scale).append(",").append(val).append(",");
+ state.isFirst = false;
+ } else if (val != null) {
+ state.stringBuilder.append(val).append(",");
}
}
-
+
// Data needs to be output for aggregation after processing
- public void serialize(State state, DataOutputStream out) {
- try {
- // Currently, only DataOutputStream is provided. If serialization
of objects is required, methods such as concatenating strings, converting to
JSON, or serializing into byte arrays can be considered
- // If the State object needs to be serialized, it may be necessary
to implement a serialization interface for the State inner class
- // Ultimately, everything needs to be transmitted via
DataOutputStream
- out.writeUTF(state.stringBuilder.toString());
- } catch (Exception e) {
- log.info("Serialization exception: " + e.getMessage());
- }
+ public void serialize(State state, DataOutputStream out) throws
IOException {
+ // Currently, only DataOutputStream is provided. If serialization of
objects is required, methods such as concatenating strings, converting to JSON,
or serializing into byte arrays can be considered
+ // If the State object needs to be serialized, it may be necessary to
implement a serialization interface for the State inner class
+ // Ultimately, everything needs to be transmitted via DataOutputStream
+ out.writeUTF(state.stringBuilder.toString());
}
-
+
// Obtain the output data from the data processing execution unit
- public void deserialize(State state, DataInputStream in) {
- try {
- String string = in.readUTF();
- state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
- StringBuilder stringBuilder = new
StringBuilder(string.substring(2));
- state.stringBuilder = stringBuilder;
- } catch (Exception e) {
- log.info("Deserialization exception: " + e.getMessage());
- }
+ public void deserialize(State state, DataInputStream in) throws
IOException {
+ String string = in.readUTF();
+ state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
+ StringBuilder stringBuilder = new StringBuilder(string.substring(2));
+ state.stringBuilder = stringBuilder;
}
-
+
// The aggregation execution unit merges the processing results of data
under certain aggregation conditions for a given key. The state1 parameter is
the initialized instance during the first merge of each key
public void merge(State state1, State state2) {
- try {
- state1.scale = state2.scale;
- state1.stringBuilder.append(state2.stringBuilder.toString());
- } catch (Exception e) {
- log.info("Merge result exception: " + e.getMessage());
- }
+ state1.scale = state2.scale;
+ state1.stringBuilder.append(state2.stringBuilder.toString());
}
-
+
// Output the final result after merging the data for each key
public Double getValue(State state) {
- try {
- String[] strings = state.stringBuilder.toString().split(",");
- double[] doubles = new double[strings.length];
- for (int i = 0; i < strings.length - 1; i++) {
- doubles[i] = Double.parseDouble(strings[i + 1]);
- }
-
- Arrays.sort(doubles);
- double n = doubles.length;
- if (n == 0) {
- return 0.0;
- }
- double index = (n - 1) / 2.0;
-
- int low = (int) Math.floor(index);
- int high = (int) Math.ceil(index);
-
- double value = low == high ? (doubles[low] + doubles[high]) / 2 :
doubles[high];
-
- BigDecimal decimal = new BigDecimal(value);
- return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
- } catch (Exception e) {
- log.info("Calculation exception: " + e.getMessage());
+ String[] strings = state.stringBuilder.toString().split(",");
+ double[] doubles = new double[strings.length];
+ for (int i = 0; i < strings.length - 1; i++) {
+ doubles[i] = Double.parseDouble(strings[i + 1]);
}
- return 0.0;
+
+ Arrays.sort(doubles);
+ double n = doubles.length;
+ if (n == 0) {
+ return 0.0;
+ }
+ double index = (n - 1) / 2.0;
+
+ int low = (int) Math.floor(index);
+ int high = (int) Math.ceil(index);
+
+ double value = low == high ? (doubles[low] + doubles[high]) / 2 :
doubles[high];
+
+ BigDecimal decimal = new BigDecimal(value);
+ return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
}
-
+
// Executed after each execution unit completes
public void destroy(State state) {
}
}
```
+
+</details>
-## Best Practices
-
-### Loading static variables
-Currently, in Doris, executing a UDF function, e.g., `select udf(col) from
table`, will load the udf.jar package for each concurrent instance, and unload
the udf.jar package when the instance ends. If the udf.jar file needs to load a
file of several hundred MBs, the memory usage will increase sharply due to
concurrency, potentially leading to OOM (Out of Memory).
+2. Register and create the Java-UDAF function in Doris. For more syntax
details, please refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-The solution is to split the resource loading code, generate a separate jar
package, and have other packages directly reference this resource jar package.
+ ```sql
+ CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES (
+ "file"="file:///pathTo/java-udaf.jar",
+ "symbol"="org.apache.doris.udf.SimpleDemo",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
-Assume the files have been split into DictLibrary and FunctionUdf.
+3. When using Java-UDAF, you can perform aggregation either by grouping or by
aggregating all results:
-1. Compile the DictLibrary file separately to generate an independent jar
package, resulting in a resource file DictLibrary.jar:
+ ```sql
+ select simple_demo(id) from test_table group by id;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 1 |
+ | 6 |
+ +-----------------+
+ ```
- ```shell
- javac ./DictLibrary.java
- jar -cf ./DictLibrary.jar ./DictLibrary.class
+ ```sql
+ select simple_demo(id) from test_table;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 7 |
+ +-----------------+
```
- ```java
- public class DictLibrary {
- private static HashMap<String, String> res = new HashMap<>();
-
- static {
- // suppose we built this dictionary from a certain local file.
- res.put("key1", "value1");
- res.put("key2", "value2");
- res.put("key3", "value3");
- res.put("0", "value4");
- res.put("1", "value5");
- res.put("2", "value6");
- }
+### Introduction to Java-UDTF Example
+
+:::tip
+UDTF is supported starting from Doris version 3.0.
+:::
+
+1. Similar to UDFs, UDTFs require users to implement an `evaluate` method.
However, the return value of a UDTF must be of the Array type.
- public static String evaluate(String key) {
- if (key == null) {
+ ```JAVA
+ public class UDTFStringTest {
+ public ArrayList<String> evaluate(String value, String separator) {
+ if (value == null || separator == null) {
return null;
+ } else {
+ return new ArrayList<>(Arrays.asList(value.split(separator)));
}
- return res.get(key);
}
}
```
- ```java
- public class FunctionUdf {
- public String evaluate(String key) {
- String value = DictLibrary.evaluate(key);
- return value;
+2. Register and create the Java-UDTF function in Doris. Two UDTF functions
will be registered. Table functions in Doris may exhibit different behaviors
due to the `_outer` suffix. For more details, refer to [OUTER
combinator](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md).
+For more syntax details, please refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string>
PROPERTIES (
+ "file"="file:///pathTo/java-udtf.jar",
+ "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. When using Java-UDTF, in Doris, UDTFs must be used with [`Lateral
View`](../lateral-view.md) to achieve the row-to-column transformation effect:
+
+ ```sql
+ select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as
e1;
+ +------+-------+------+
+ | id | str | e1 |
+ +------+-------+------+
+ | 1 | a,b,c | a |
+ | 1 | a,b,c | b |
+ | 1 | a,b,c | c |
+ | 6 | d,e | d |
+ | 6 | d,e | e |
+ +------+-------+------+
+ ```
+
+## Best Practices
+
+*Loading static variables*
+
+Currently, in Doris, executing a UDF function, e.g., `select udf(col) from
table`, will load the udf.jar package for each concurrent instance, and unload
the udf.jar package when the instance finish.
+
+If the udf.jar file needs to load a file of several hundred MBs, the memory
usage will increase sharply due to concurrency, potentially leading to OOM (Out
of Memory).
+
+Alternatively, if you want to use a connection pool, this approach will not
allow you to initialize it only once in the static area.
+
+Here are two solutions, with the second solution requiring Doris version
branch-3.0 or above.
+
+*Solution 1:*
+
+The solution is to split the resource loading code, generate a separate jar
package, and have other packages directly reference this resource jar package.
+
+Assume the files have been split into `DictLibrary` and `FunctionUdfAR`.
+
+```java
+public class DictLibrary {
+ private static HashMap<String, String> res = new HashMap<>();
+
+ static {
+ // suppose we built this dictionary from a certain local file.
+ res.put("key1", "value1");
+ res.put("key2", "value2");
+ res.put("key3", "value3");
+ res.put("0", "value4");
+ res.put("1", "value5");
+ res.put("2", "value6");
+ }
+
+ public static String evaluate(String key) {
+ if (key == null) {
+ return null;
}
+ return res.get(key);
}
+}
+```
+
+```java
+public class FunctionUdf {
+ public String evaluate(String key) {
+ String value = DictLibrary.evaluate(key);
+ return value;
+ }
+}
+```
+
+1. Compile the DictLibrary file separately to generate an independent jar
package, resulting in a resource file DictLibrary.jar:
+
+ ```shell
+ javac ./DictLibrary.java
+ jar -cf ./DictLibrary.jar ./DictLibrary.class
```
2. Then compile the FunctionUdf file, directly referencing the resource
package from the previous step, resulting in the FunctionUdf.jar package:
@@ -395,28 +468,76 @@ Assume the files have been split into DictLibrary and
FunctionUdf.
jar -cvf ./FunctionUdf.jar ./FunctionUdf.class
```
-3. After the above two steps, you will get two jar packages. To allow the
resource jar package to be referenced by all concurrent instances, place it in
the deployment path `fe/custom_lib` 和 `be/custom_lib`. After the restarting, it
will be loaded with the JVM startup.
+3. After the above two steps, you will get two jar packages. To allow the
resource jar package to be referenced by all concurrent instances, place it in
the deployment path `be/custom_lib`. After the restarting, it will be loaded
with the JVM startup. As a result, the resources will be loaded when the
service starts and released when the service stops.
4. Finally, use the `create function` statement to create a UDF function
```sql
CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES (
+ "file"="file:///pathTo/FunctionUdf.jar",
"symbol"="org.apache.doris.udf.FunctionUdf",
"always_nullable"="true",
"type"="JAVA_UDF"
);
```
-In this loading mode, both FunctionUdf.jar and DictLibrary.jar are in the
custom_lib path of FE and BE. This way, the packages will be loaded and
released with the service startup and shutdown.
+*Solution 2:*
-You can also customize the path to `FunctionUdf.jar` using `file:///`, but
only under `custom_lib`.
+The BE (Backend) globally caches the JAR file and customizes the expiration
and eviction time. When creating a function, two additional properties are
added:
-## Usage Notes
+static_load: This defines whether to use the static cache loading method.
+expiration_time: This defines the expiration time of the JAR file, in minutes.
+If the static cache loading method is used, the UDF instance will be cached
after the first call and initialization. On subsequent calls to the UDF, the
system will first search in the cache. If not found, the initialization process
will be triggered.
-1. Complex data types (HLL, Bitmap) are not supported.
+Additionally, a background thread regularly checks the cache. If the function
has not been called within the configured expiration time, it will be evicted
from the cache. If the function is called, the cache timestamp will be
automatically updated.
-2. Users are currently allowed to specify the maximum JVM heap size. The
configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default
is 1024m. If you need to aggregate data, it is recommended to increase this
value to enhance performance and reduce the risk of memory overflow.
+```sql
+public class Print extends UDF {
+ static Integer val = 0;
+ public Integer evaluate() {
+ val = val + 1;
+ return val;
+ }
+}
+```
-3. For Char type UDFs, use the String type when creating the function.
+```sql
+CREATE FUNCTION print_12() RETURNS int
+PROPERTIES (
+ "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol" = "org.apache.doris.udf.Print",
+ "always_nullable"="true",
+ "type" = "JAVA_UDF",
+ "static_load" = "true", // default value is false
+ "expiration_time" = "60" // default value is 360 minutes
+);
+```
-4. Due to issues with JVM loading classes with the same name, do not use
multiple classes with the same name as UDF implementations simultaneously. If
you want to update a UDF with a class of the same name, you need to restart BE
to reload the classpath.
+As we can see, the result keeps incrementing, which proves that the loaded JAR
file is not being unloaded and reloaded. Instead, the variables are being
re-initialized to 0.
+
+```sql
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 1 |
++------------+
+1 row in set (0.40 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 2 |
++------------+
+1 row in set (0.03 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 3 |
++------------+
+1 row in set (0.04 sec)
+
+```
diff --git
a/versioned_docs/version-3.0/query-data/udf/java-user-defined-function.md
b/versioned_docs/version-3.0/query-data/udf/java-user-defined-function.md
index be614f39b1b..3c0d91b2c6b 100644
--- a/versioned_docs/version-3.0/query-data/udf/java-user-defined-function.md
+++ b/versioned_docs/version-3.0/query-data/udf/java-user-defined-function.md
@@ -1,6 +1,6 @@
---
{
-"title": "Java UDF",
+"title": "Java UDF, UDAF, UDTF",
"language": "en"
}
---
@@ -25,76 +25,15 @@ under the License.
-->
-## Introduction to Java UDF
+## Overview
+Java UDF provides a Java interface for users to implement user-defined
functions (UDFs) conveniently using the Java programming language.
+Doris supports the use of Java to develop UDFs, UDAFs, and UDTFs. Unless
otherwise specified, "UDF" in the following text refers to all types of
user-defined functions.
-Java UDF provides users with a Java interface for writing UDFs, making it
convenient for users to execute custom functions using the Java language.
+1. Java UDF: A Java UDF is a commonly used scalar function, where each input
row produces a corresponding output row. Common examples include ABS and
LENGTH. Notably, Hive UDFs can be directly migrated to Doris, which is
convenient for users.
-Doris supports writing UDFs, UDAFs, and UDTFs using JAVA. Unless otherwise
specified, UDF is used as a general term for all user-defined functions in the
following text.
+2. Java UDAF: A Java UDAF is a user-defined aggregate function that aggregates
multiple input rows into a single output row. Common examples include MIN, MAX,
and COUNT.
-## Creating UDF
-
-The implemented jar package can be placed locally or stored on a remote server
for download via HTTP, but each FE and BE node must be able to access the jar
package.
-
-Otherwise, an error message `Couldn't open file ......` will be returned.
-
-For more syntax help, refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/function/CREATE-FUNCTION.md).
-
-### UDF
-
-```sql
-CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
- "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
- "symbol"="org.apache.doris.udf.AddOne",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-### UDAF
-
-```sql
-CREATE AGGREGATE FUNCTION middle_quantiles(DOUBLE,INT) RETURNS DOUBLE
PROPERTIES (
- "file"="file:///pathTo/java-udaf.jar",
- "symbol"="org.apache.doris.udf.demo.MiddleNumberUDAF",
- "always_nullable"="true",
- "type"="JAVA_UDF"
-);
-```
-
-
-### UDTF
-
-:::tip
-UDTF is supported starting from Doris version 3.0.
-:::
-
-Similar to UDFs, UDTFs require users to implement an `evaluate` method.
However, the return value of a UDTF must be of the Array type.
-
-Additionally, table functions in Doris may exhibit different behaviors due to
the `_outer` suffix. For more details, refer to [OUTER
combinator](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md).
-
-```java
-public class UDTFStringTest {
- public ArrayList<String> evaluate(String value, String separator) {
- if (value == null || separator == null) {
- return null;
- } else {
- return new ArrayList<>(Arrays.asList(value.split(separator)));
- }
- }
-}
-```
-
-## Using UDF
-
-To utilize UDFs, users must possess the `SELECT` privilege for the
corresponding database.
-
-The usage of UDFs is identical to standard functions, with the primary
distinction being that built-in functions have a global scope, while UDFs are
scoped within the DB.
-
-When the session is linked within the database, directly using the UDF name
will search for the corresponding UDF within the current DB. Otherwise, users
must explicitly specify the UDF's database name, for example, `dbName.funcName`.
-
-## Dropping UDF
-
-If a UDF is no longer needed, it can be dropped using the following command,
as detailed in [DROP
FUNCTION](../../sql-manual/sql-statements/function/DROP-FUNCTION).
+3. Java UDTF: A Java UDTF is a user-defined table function, where a single
input row can generate one or multiple output rows. In Doris, UDTFs must be
used with Lateral View to achieve row-to-column transformations. Common
examples include EXPLODE and EXPLODE_SPLIT.
## Type Correspondence
@@ -125,36 +64,92 @@ If a UDF is no longer needed, it can be dropped using the
following command, as
When creating functions, avoid using `varchar` in place of `string`, as this
may cause the function to fail.
:::
-## Writing UDF
+## Usage Notes
+
+1. Complex data types (HLL, Bitmap) are not supported.
+
+2. Users are currently allowed to specify the maximum JVM heap size. The
configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default
is 1024m. If you need to aggregate data, it is recommended to increase this
value to enhance performance and reduce the risk of memory overflow.
+
+3. Due to issues with JVM loading classes with the same name, do not use
multiple classes with the same name as UDF implementations simultaneously. If
you want to update a UDF with a class of the same name, you need to restart BE
to reload the classpath.
+
+
+## Getting Started
This section mainly introduces how to develop a Java UDF. Examples are
provided in `samples/doris-demo/java-udf-demo/` for reference. Click
[here](https://github.com/apache/doris/tree/master/samples/doris-demo/java-udf-demo)
to view details.
-When writing a UDF in Java, the main entry point must be the `evaluate`
function. This is consistent with other engines like Hive. In this example, we
write an `AddOne` UDF to perform an increment operation on integer inputs.
+The usage of UDFs is identical to standard functions, with the primary
distinction being that built-in functions have a global scope, while UDFs are
scoped within the DB.
+
+When the session is linked within the database, directly using the UDF name
will search for the corresponding UDF within the current DB. Otherwise, users
must explicitly specify the UDF's database name, for example, `dbName.funcName`.
-It is worth mentioning that this example not only supports Java UDFs in Doris
but is also a UDF supported by Hive. This means that Hive UDFs can be directly
migrated to Doris.
+In the following sections, examples will use the table `test_table`. The
corresponding table creation script is as follows:
-Additionally, if the UDF being defined needs to load large resource files or
if you want to define global static variables, you can refer to the static
variable loading method described at the bottom of the document.
+```sql
+CREATE TABLE `test_table` (
+ id int NULL,
+ d1 double NULL,
+ str string NULL
+) ENGINE=OLAP
+DUPLICATE KEY(`id`)
+DISTRIBUTED BY HASH(`id`) BUCKETS 1
+PROPERTIES (
+"replication_num" = "1");
+
+insert into test_table values (1, 111.11, "a,b,c");
+insert into test_table values (6, 666.66, "d,e");
+```
-### UDF
-```java
-public class AddOne extends UDF {
- public Integer evaluate(Integer value) {
- return value == null ? null : value + 1;
+### Introduction to Java-UDF Example
+When writing a UDF in Java, the main entry point must be the `evaluate`
function. This is consistent with other engines like Hive. In this example, we
write an `AddOne` UDF to perform an increment operation on integer inputs.
+
+1. Write the corresponding Java code and package it into a JAR file.
+
+ ```java
+ public class AddOne extends UDF {
+ public Integer evaluate(Integer value) {
+ return value == null ? null : value + 1;
+ }
}
-}
-```
+ ```
+
+2. Register and create the Java-UDF function in Doris. For more details on the
syntax, refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE FUNCTION java_udf_add_one(int) RETURNS int PROPERTIES (
+ "file"="file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol"="org.apache.doris.udf.AddOne",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. To utilize UDFs, users must possess the `SELECT` privilege for the
corresponding database. And to verify the successful registration of the UDF,
you can use the [SHOW
FUNCTIONS](../../sql-manual/sql-statements/Show-Statements/SHOW-FUNCTIONS.md)
command.
+
+ ``` sql
+ select id,java_udf_add_one(id) from test_table;
+ +------+----------------------+
+ | id | java_udf_add_one(id) |
+ +------+----------------------+
+ | 1 | 2 |
+ | 6 | 7 |
+ +------+----------------------+
+ ```
+
+4. If a UDF is no longer needed, it can be dropped using the following
command, as detailed in [DROP
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Drop/DROP-FUNCTION).
+
+Additionally, if your UDF requires loading large resource files or defining
global static variables, you can refer to the method for loading static
variables described later in this document.
-### UDAF
+### Introduction to Java-UDAF Example
-When writing a UDAF using Java code, there are some required functions (marked
as required) and an inner class State that must be implemented. Below is a
specific example to illustrate.
+When writing a `UDAF` using Java, there are some functions that must be
implemented (marked as required) along with an internal class State. The
following example will illustrate how to implement them.
-**Example 1**
+1. Write the corresponding Java UDAF code and package it into a JAR file.
-The following SimpleDemo will implement a simple function similar to sum, with
the input parameter being INT and the output parameter being INT.
+<details>
+<summary> Example 1: SimpleDemo will implement a simple function similar to
sum, where the input parameter is INT and the output parameter is
INT.</summary>
```java
-package org.apache.doris.udf.demo;
+package org.apache.doris.udf;
import java.io.DataInputStream;
import java.io.DataOutputStream;
@@ -200,26 +195,15 @@ public class SimpleDemo {
}
/*required*/
- public void serialize(State state, DataOutputStream out) {
+ public void serialize(State state, DataOutputStream out) throws Exception {
/* serialize some data into buffer */
- try {
- out.writeInt(state.sum);
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
+ out.writeInt(state.sum);
}
/*required*/
- public void deserialize(State state, DataInputStream in) {
+ public void deserialize(State state, DataInputStream in) throws Exception {
/* deserialize get data from buffer before you put */
- int val = 0;
- try {
- val = in.readInt();
- } catch (Exception e) {
- /* Do not throw exceptions */
- log.info(e.getMessage());
- }
+ int val = in.readInt();
state.sum = val;
}
@@ -236,24 +220,28 @@ public class SimpleDemo {
return state.sum;
}
}
-
```
-**Example 2**
+</details>
+
+
+<details>
+<summary> Example 2: MedianUDAF is a function that calculates the median. The
input types are (DOUBLE, INT), and the output type is DOUBLE. </summary>
```java
package org.apache.doris.udf.demo;
-
+
import java.io.DataInputStream;
-import java.io.DataOutputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
import java.math.BigDecimal;
import java.util.Arrays;
import java.util.logging.Logger;
-
+
/* UDAF to calculate the median */
public class MedianUDAF {
Logger log = Logger.getLogger("MedianUDAF");
-
+
// State storage
public static class State {
// Precision of the return result
@@ -263,7 +251,7 @@ public class MedianUDAF {
// Data storage
public StringBuilder stringBuilder;
}
-
+
// Initialize the state
public State create() {
State state = new State();
@@ -271,138 +259,206 @@ public class MedianUDAF {
state.stringBuilder = new StringBuilder(1000);
return state;
}
-
+
// Process each data under respective aggregation conditions for each
tablet
public void add(State state, Double val, int scale) {
- try {
- if (val != null && state.isFirst) {
-
state.stringBuilder.append(scale).append(",").append(val).append(",");
- state.isFirst = false;
- } else if (val != null) {
- state.stringBuilder.append(val).append(",");
- }
- } catch (Exception e) {
- // If it cannot be guaranteed that there will be no exceptions, it
is recommended to maximize exception capture in each method, as handling of
exceptions thrown by Java is currently not supported
- log.info("Data acquisition exception: " + e.getMessage());
+ if (val != null && state.isFirst) {
+
state.stringBuilder.append(scale).append(",").append(val).append(",");
+ state.isFirst = false;
+ } else if (val != null) {
+ state.stringBuilder.append(val).append(",");
}
}
-
+
// Data needs to be output for aggregation after processing
- public void serialize(State state, DataOutputStream out) {
- try {
- // Currently, only DataOutputStream is provided. If serialization
of objects is required, methods such as concatenating strings, converting to
JSON, or serializing into byte arrays can be considered
- // If the State object needs to be serialized, it may be necessary
to implement a serialization interface for the State inner class
- // Ultimately, everything needs to be transmitted via
DataOutputStream
- out.writeUTF(state.stringBuilder.toString());
- } catch (Exception e) {
- log.info("Serialization exception: " + e.getMessage());
- }
+ public void serialize(State state, DataOutputStream out) throws
IOException {
+ // Currently, only DataOutputStream is provided. If serialization of
objects is required, methods such as concatenating strings, converting to JSON,
or serializing into byte arrays can be considered
+ // If the State object needs to be serialized, it may be necessary to
implement a serialization interface for the State inner class
+ // Ultimately, everything needs to be transmitted via DataOutputStream
+ out.writeUTF(state.stringBuilder.toString());
}
-
+
// Obtain the output data from the data processing execution unit
- public void deserialize(State state, DataInputStream in) {
- try {
- String string = in.readUTF();
- state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
- StringBuilder stringBuilder = new
StringBuilder(string.substring(2));
- state.stringBuilder = stringBuilder;
- } catch (Exception e) {
- log.info("Deserialization exception: " + e.getMessage());
- }
+ public void deserialize(State state, DataInputStream in) throws
IOException {
+ String string = in.readUTF();
+ state.scale = Integer.parseInt(String.valueOf(string.charAt(0)));
+ StringBuilder stringBuilder = new StringBuilder(string.substring(2));
+ state.stringBuilder = stringBuilder;
}
-
+
// The aggregation execution unit merges the processing results of data
under certain aggregation conditions for a given key. The state1 parameter is
the initialized instance during the first merge of each key
public void merge(State state1, State state2) {
- try {
- state1.scale = state2.scale;
- state1.stringBuilder.append(state2.stringBuilder.toString());
- } catch (Exception e) {
- log.info("Merge result exception: " + e.getMessage());
- }
+ state1.scale = state2.scale;
+ state1.stringBuilder.append(state2.stringBuilder.toString());
}
-
+
// Output the final result after merging the data for each key
public Double getValue(State state) {
- try {
- String[] strings = state.stringBuilder.toString().split(",");
- double[] doubles = new double[strings.length];
- for (int i = 0; i < strings.length - 1; i++) {
- doubles[i] = Double.parseDouble(strings[i + 1]);
- }
-
- Arrays.sort(doubles);
- double n = doubles.length;
- if (n == 0) {
- return 0.0;
- }
- double index = (n - 1) / 2.0;
-
- int low = (int) Math.floor(index);
- int high = (int) Math.ceil(index);
-
- double value = low == high ? (doubles[low] + doubles[high]) / 2 :
doubles[high];
-
- BigDecimal decimal = new BigDecimal(value);
- return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
- } catch (Exception e) {
- log.info("Calculation exception: " + e.getMessage());
+ String[] strings = state.stringBuilder.toString().split(",");
+ double[] doubles = new double[strings.length];
+ for (int i = 0; i < strings.length - 1; i++) {
+ doubles[i] = Double.parseDouble(strings[i + 1]);
+ }
+
+ Arrays.sort(doubles);
+ double n = doubles.length;
+ if (n == 0) {
+ return 0.0;
}
- return 0.0;
+ double index = (n - 1) / 2.0;
+
+ int low = (int) Math.floor(index);
+ int high = (int) Math.ceil(index);
+
+ double value = low == high ? (doubles[low] + doubles[high]) / 2 :
doubles[high];
+
+ BigDecimal decimal = new BigDecimal(value);
+ return decimal.setScale(state.scale,
BigDecimal.ROUND_HALF_UP).doubleValue();
}
-
+
// Executed after each execution unit completes
public void destroy(State state) {
}
}
```
+
+</details>
-## Best Practices
-
-### Loading static variables
-Currently, in Doris, executing a UDF function, e.g., `select udf(col) from
table`, will load the udf.jar package for each concurrent instance, and unload
the udf.jar package when the instance ends. If the udf.jar file needs to load a
file of several hundred MBs, the memory usage will increase sharply due to
concurrency, potentially leading to OOM (Out of Memory).
+2. Register and create the Java-UDAF function in Doris. For more syntax
details, please refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
-The solution is to split the resource loading code, generate a separate jar
package, and have other packages directly reference this resource jar package.
+ ```sql
+ CREATE AGGREGATE FUNCTION simple_demo(INT) RETURNS INT PROPERTIES (
+ "file"="file:///pathTo/java-udaf.jar",
+ "symbol"="org.apache.doris.udf.SimpleDemo",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
-Assume the files have been split into DictLibrary and FunctionUdf.
+3. When using Java-UDAF, you can perform aggregation either by grouping or by
aggregating all results:
-1. Compile the DictLibrary file separately to generate an independent jar
package, resulting in a resource file DictLibrary.jar:
+ ```sql
+ select simple_demo(id) from test_table group by id;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 1 |
+ | 6 |
+ +-----------------+
+ ```
- ```shell
- javac ./DictLibrary.java
- jar -cf ./DictLibrary.jar ./DictLibrary.class
+ ```sql
+ select simple_demo(id) from test_table;
+ +-----------------+
+ | simple_demo(id) |
+ +-----------------+
+ | 7 |
+ +-----------------+
```
- ```java
- public class DictLibrary {
- private static HashMap<String, String> res = new HashMap<>();
-
- static {
- // suppose we built this dictionary from a certain local file.
- res.put("key1", "value1");
- res.put("key2", "value2");
- res.put("key3", "value3");
- res.put("0", "value4");
- res.put("1", "value5");
- res.put("2", "value6");
- }
+### Introduction to Java-UDTF Example
+
+:::tip
+UDTF is supported starting from Doris version 3.0.
+:::
- public static String evaluate(String key) {
- if (key == null) {
+1. Similar to UDFs, UDTFs require users to implement an `evaluate` method.
However, the return value of a UDTF must be of the Array type.
+
+ ```JAVA
+ public class UDTFStringTest {
+ public ArrayList<String> evaluate(String value, String separator) {
+ if (value == null || separator == null) {
return null;
+ } else {
+ return new ArrayList<>(Arrays.asList(value.split(separator)));
}
- return res.get(key);
}
}
```
- ```java
- public class FunctionUdf {
- public String evaluate(String key) {
- String value = DictLibrary.evaluate(key);
- return value;
+2. Register and create the Java-UDTF function in Doris. Two UDTF functions
will be registered. Table functions in Doris may exhibit different behaviors
due to the `_outer` suffix. For more details, refer to [OUTER
combinator](../../sql-manual/sql-functions/table-functions/explode-numbers-outer.md).
+For more syntax details, please refer to [CREATE
FUNCTION](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-FUNCTION.md).
+
+ ```sql
+ CREATE TABLES FUNCTION java-utdf(string, string) RETURNS array<string>
PROPERTIES (
+ "file"="file:///pathTo/java-udtf.jar",
+ "symbol"="org.apache.doris.udf.demo.UDTFStringTest",
+ "always_nullable"="true",
+ "type"="JAVA_UDF"
+ );
+ ```
+
+3. When using Java-UDTF, in Doris, UDTFs must be used with [`Lateral
View`](../lateral-view.md) to achieve the row-to-column transformation effect:
+
+ ```sql
+ select id, str, e1 from test_table lateral view java_utdf(str,',') tmp as
e1;
+ +------+-------+------+
+ | id | str | e1 |
+ +------+-------+------+
+ | 1 | a,b,c | a |
+ | 1 | a,b,c | b |
+ | 1 | a,b,c | c |
+ | 6 | d,e | d |
+ | 6 | d,e | e |
+ +------+-------+------+
+ ```
+
+## Best Practices
+
+*Loading static variables*
+
+Currently, in Doris, executing a UDF function, e.g., `select udf(col) from
table`, will load the udf.jar package for each concurrent instance, and unload
the udf.jar package when the instance finish.
+
+If the udf.jar file needs to load a file of several hundred MBs, the memory
usage will increase sharply due to concurrency, potentially leading to OOM (Out
of Memory).
+
+Alternatively, if you want to use a connection pool, this approach will not
allow you to initialize it only once in the static area.
+
+Here are two solutions, with the second solution requiring Doris version
branch-3.0 or above.
+
+*Solution 1:*
+
+The solution is to split the resource loading code, generate a separate jar
package, and have other packages directly reference this resource jar package.
+
+Assume the files have been split into `DictLibrary` and `FunctionUdfAR`.
+
+```java
+public class DictLibrary {
+ private static HashMap<String, String> res = new HashMap<>();
+
+ static {
+ // suppose we built this dictionary from a certain local file.
+ res.put("key1", "value1");
+ res.put("key2", "value2");
+ res.put("key3", "value3");
+ res.put("0", "value4");
+ res.put("1", "value5");
+ res.put("2", "value6");
+ }
+
+ public static String evaluate(String key) {
+ if (key == null) {
+ return null;
}
+ return res.get(key);
+ }
+}
+```
+
+```java
+public class FunctionUdf {
+ public String evaluate(String key) {
+ String value = DictLibrary.evaluate(key);
+ return value;
}
+}
+```
+
+1. Compile the DictLibrary file separately to generate an independent jar
package, resulting in a resource file DictLibrary.jar:
+
+ ```shell
+ javac ./DictLibrary.java
+ jar -cf ./DictLibrary.jar ./DictLibrary.class
```
2. Then compile the FunctionUdf file, directly referencing the resource
package from the previous step, resulting in the FunctionUdf.jar package:
@@ -412,28 +468,76 @@ Assume the files have been split into DictLibrary and
FunctionUdf.
jar -cvf ./FunctionUdf.jar ./FunctionUdf.class
```
-3. After the above two steps, you will get two jar packages. To allow the
resource jar package to be referenced by all concurrent instances, place it in
the deployment path `fe/custom_lib` 和 `be/custom_lib`. After the restarting, it
will be loaded with the JVM startup.
+3. After the above two steps, you will get two jar packages. To allow the
resource jar package to be referenced by all concurrent instances, place it in
the deployment path `be/custom_lib`. After the restarting, it will be loaded
with the JVM startup. As a result, the resources will be loaded when the
service starts and released when the service stops.
4. Finally, use the `create function` statement to create a UDF function
```sql
CREATE FUNCTION java_udf_dict(string) RETURNS string PROPERTIES (
+ "file"="file:///pathTo/FunctionUdf.jar",
"symbol"="org.apache.doris.udf.FunctionUdf",
"always_nullable"="true",
"type"="JAVA_UDF"
);
```
-In this loading mode, both FunctionUdf.jar and DictLibrary.jar are in the
custom_lib path of FE and BE. This way, the packages will be loaded and
released with the service startup and shutdown.
+*Solution 2:*
-You can also customize the path to `FunctionUdf.jar` using `file:///`, but
only under `custom_lib`.
+The BE (Backend) globally caches the JAR file and customizes the expiration
and eviction time. When creating a function, two additional properties are
added:
-## Usage Notes
+static_load: This defines whether to use the static cache loading method.
+expiration_time: This defines the expiration time of the JAR file, in minutes.
+If the static cache loading method is used, the UDF instance will be cached
after the first call and initialization. On subsequent calls to the UDF, the
system will first search in the cache. If not found, the initialization process
will be triggered.
-1. Complex data types (HLL, Bitmap) are not supported.
+Additionally, a background thread regularly checks the cache. If the function
has not been called within the configured expiration time, it will be evicted
from the cache. If the function is called, the cache timestamp will be
automatically updated.
-2. Users are currently allowed to specify the maximum JVM heap size. The
configuration item is the `-Xmx` part of `JAVA_OPTS` in `be.conf`. The default
is 1024m. If you need to aggregate data, it is recommended to increase this
value to enhance performance and reduce the risk of memory overflow.
+```sql
+public class Print extends UDF {
+ static Integer val = 0;
+ public Integer evaluate() {
+ val = val + 1;
+ return val;
+ }
+}
+```
+
+```sql
+CREATE FUNCTION print_12() RETURNS int
+PROPERTIES (
+ "file" = "file:///path/to/java-udf-demo-jar-with-dependencies.jar",
+ "symbol" = "org.apache.doris.udf.Print",
+ "always_nullable"="true",
+ "type" = "JAVA_UDF",
+ "static_load" = "true", // default value is false
+ "expiration_time" = "60" // default value is 360 minutes
+);
+```
-3. For Char type UDFs, use the String type when creating the function.
+As we can see, the result keeps incrementing, which proves that the loaded JAR
file is not being unloaded and reloaded. Instead, the variables are being
re-initialized to 0.
-4. Due to issues with JVM loading classes with the same name, do not use
multiple classes with the same name as UDF implementations simultaneously. If
you want to update a UDF with a class of the same name, you need to restart BE
to reload the classpath.
+```sql
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 1 |
++------------+
+1 row in set (0.40 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 2 |
++------------+
+1 row in set (0.03 sec)
+
+mysql [test_query_qa]>select print_12();
++------------+
+| print_12() |
++------------+
+| 3 |
++------------+
+1 row in set (0.04 sec)
+
+```
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]