(seatunnel) branch dev updated: [Feature] [Zeta] Optimized llm doc && add DOUBAO LLM (#7584)

wuchunfu Wed, 04 Sep 2024 23:23:31 -0700

This is an automated email from the ASF dual-hosted git repository.

wuchunfu pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new f3ca5a4faf [Feature] [Zeta] Optimized llm doc && add DOUBAO LLM (#7584)
f3ca5a4faf is described below

commit f3ca5a4faf399a233cae104dd616e4aa36f4b1dd
Author: corgy-w <[email protected]>
AuthorDate: Thu Sep 5 14:21:45 2024 +0800

    [Feature] [Zeta] Optimized llm doc && add DOUBAO LLM (#7584)
---
 docs/en/transform-v2/llm.md                        | 156 +++++++++++++++++++--
 docs/zh/transform-v2/llm.md                        | 156 +++++++++++++++++++--
 .../src/test/resources/llm_transform_custom.conf   |  17 ++-
 .../transform/nlpmodel/ModelProvider.java          |   4 +-
 .../transform/nlpmodel/llm/LLMTransform.java       |   2 +-
 5 files changed, 307 insertions(+), 28 deletions(-)

diff --git a/docs/en/transform-v2/llm.md b/docs/en/transform-v2/llm.md
index b2142c4297..8caaad00a0 100644
--- a/docs/en/transform-v2/llm.md
+++ b/docs/en/transform-v2/llm.md
@@ -10,19 +10,23 @@ more.
 
 ## Options
 
-| name             | type   | required | default value |
-|------------------|--------|----------|---------------|
-| model_provider   | enum   | yes      |               |
-| output_data_type | enum   | no       | String        |
-| prompt           | string | yes      |               |
-| model            | string | yes      |               |
-| api_key          | string | yes      |               |
-| api_path         | string | no       |               |
+| name                   | type   | required | default value |
+|------------------------|--------|----------|---------------|
+| model_provider         | enum   | yes      |               |
+| output_data_type       | enum   | no       | String        |
+| prompt                 | string | yes      |               |
+| model                  | string | yes      |               |
+| api_key                | string | yes      |               |
+| api_path               | string | no       |               |
+| custom_config          | map    | no       |               | 
+| custom_response_parse  | string | no       |               | 
+| custom_request_headers | map    | no       |               |
+| custom_request_body    | map    | no       |               | 
 
 ### model_provider
 
 The model provider to use. The available options are:
-OPENAI
+OPENAI、DOUBAO、CUSTOM
 
 ### output_data_type
 
@@ -74,6 +78,61 @@ If you use OpenAI model, please refer 
https://platform.openai.com/docs/api-refer
 The API path to use for the model provider. In most cases, you do not need to 
change this configuration. If you
 are using an API agent's service, you may need to configure it to the agent's 
API address.
 
+### custom_config
+
+The `custom_config` option allows you to provide additional custom 
configurations for the model. This is a map where you
+can define various settings that might be required by the specific model 
you're using.
+
+### custom_response_parse
+
+The `custom_response_parse` option allows you to specify how to parse the 
model's response. You can use JsonPath to
+extract the specific data you need from the response. For example, by using 
`$.choices[*].message.content`, you can
+extract the `content` field values from the following JSON. For more details 
on using JsonPath, please refer to
+the [JsonPath Getting Started 
guide](https://github.com/json-path/JsonPath?tab=readme-ov-file#getting-started).
+
+```json
+{
+  "id": "chatcmpl-9s4hoBNGV0d9Mudkhvgzg64DAWPnx",
+  "object": "chat.completion",
+  "created": 1722674828,
+  "model": "gpt-4o-mini",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "[\"Chinese\"]"
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 107,
+    "completion_tokens": 3,
+    "total_tokens": 110
+  },
+  "system_fingerprint": "fp_0f03d4f0ee",
+  "code": 0,
+  "msg": "ok"
+}
+```
+
+### custom_request_headers
+
+The `custom_request_headers` option allows you to define custom headers that 
should be included in the request sent to
+the model's API. This is useful if the API requires additional headers beyond 
the standard ones, such as authorization
+tokens, content types, etc.
+
+### custom_request_body
+
+The `custom_request_body` option supports placeholders:
+
+- `${model}`: Placeholder for the model name.
+- `${input}`: Placeholder to determine input value and define request body 
request type based on the type of body
+  value. Example: `"${input}"` -> "input"
+- `${prompt}`：Placeholder for LLM model prompts.
+
 ### common options [string]
 
 Transform plugin common parameters, please refer to [Transform 
Plugin](common-options.md) for details
@@ -122,3 +181,82 @@ sink {
 }
 ```
 
+### Customize the LLM model
+
+```hocon
+env {
+  job.mode = "BATCH"
+}
+
+source {
+  FakeSource {
+    row.num = 5
+    schema = {
+      fields {
+        id = "int"
+        name = "string"
+      }
+    }
+    rows = [
+      {fields = [1, "Jia Fan"], kind = INSERT}
+      {fields = [2, "Hailin Wang"], kind = INSERT}
+      {fields = [3, "Tomas"], kind = INSERT}
+      {fields = [4, "Eric"], kind = INSERT}
+      {fields = [5, "Guangdong Liu"], kind = INSERT}
+    ]
+    result_table_name = "fake"
+  }
+}
+
+transform {
+  LLM {
+    source_table_name = "fake"
+    model_provider = CUSTOM
+    model = gpt-4o-mini
+    api_key = sk-xxx
+    prompt = "Determine whether someone is Chinese or American by their name"
+    openai.api_path = "http://mockserver:1080/v1/chat/completions";
+    custom_config={
+            custom_response_parse = "$.choices[*].message.content"
+            custom_request_headers = {
+                Content-Type = "application/json"
+                Authorization = "Bearer xxxxxxxx"            
+            }
+            custom_request_body ={
+                model = "${model}"
+                messages = [
+                {
+                    role = "system"
+                    content = "${prompt}"
+                },
+                {
+                    role = "user"
+                    content = "${input}"
+                }]
+            }
+        }
+    result_table_name = "llm_output"
+  }
+}
+
+sink {
+  Assert {
+    source_table_name = "llm_output"
+    rules =
+      {
+        field_rules = [
+          {
+            field_name = llm_output
+            field_type = string
+            field_value = [
+              {
+                rule_type = NOT_NULL
+              }
+            ]
+          }
+        ]
+      }
+  }
+}
+```
+
diff --git a/docs/zh/transform-v2/llm.md b/docs/zh/transform-v2/llm.md
index 3a6904b667..5efcf47125 100644
--- a/docs/zh/transform-v2/llm.md
+++ b/docs/zh/transform-v2/llm.md
@@ -8,19 +8,23 @@
 
 ## 属性
 
-| 名称               | 类型     | 是否必须 | 默认值    |
-|------------------|--------|------|--------|
-| model_provider   | enum   | yes  |        |
-| output_data_type | enum   | no   | String |
-| prompt           | string | yes  |        |
-| model            | string | yes  |        |
-| api_key          | string | yes  |        |
-| api_path         | string | no   |        |
+| 名称                     | 类型     | 是否必须 | 默认值    |
+|------------------------|--------|------|--------|
+| model_provider         | enum   | yes  |        |
+| output_data_type       | enum   | no   | String |
+| prompt                 | string | yes  |        |
+| model                  | string | yes  |        |
+| api_key                | string | yes  |        |
+| api_path               | string | no   |        |
+| custom_config          | map    | no   |        | 
+| custom_response_parse  | string | no   |        | 
+| custom_request_headers | map    | no   |        |
+| custom_request_body    | map    | no   |        | 
 
 ### model_provider
 
 要使用的模型提供者。可用选项为:
-OPENAI
+OPENAI、DOUBAO、CUSTOM
 
 ### output_data_type
 
@@ -59,7 +63,8 @@ Determine whether someone is Chinese or American by their name
 ### model
 
 要使用的模型。不同的模型提供者有不同的模型。例如，OpenAI 模型可以是 `gpt-4o-mini`。
-如果使用 OpenAI 模型，请参考 
https://platform.openai.com/docs/models/model-endpoint-compatibility 
文档的`/v1/chat/completions` 端点。
+如果使用 OpenAI 模型，请参考 
https://platform.openai.com/docs/models/model-endpoint-compatibility
+文档的`/v1/chat/completions` 端点。
 
 ### api_key
 
@@ -70,6 +75,57 @@ Determine whether someone is Chinese or American by their 
name
 
 用于模型提供者的 API 路径。在大多数情况下，您不需要更改此配置。如果使用 API 代理的服务，您可能需要将其配置为代理的 API 地址。
 
+### custom_config
+
+`custom_config` 选项允许您为模型提供额外的自定义配置。这是一个 Map，您可以在其中定义特定模型可能需要的各种设置。
+
+### custom_response_parse
+
+`custom_response_parse` 选项允许您指定如何解析模型的响应。您可以使用 JsonPath
+从响应中提取所需的特定数据。例如，使用 `$.choices[*].message.content` 提取如下json中的 `content` 字段
+值。JsonPath 的使用请参考 [JsonPath 
快速入门](https://github.com/json-path/JsonPath?tab=readme-ov-file#getting-started)
+
+```json
+{
+  "id": "chatcmpl-9s4hoBNGV0d9Mudkhvgzg64DAWPnx",
+  "object": "chat.completion",
+  "created": 1722674828,
+  "model": "gpt-4o-mini",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "[\"Chinese\"]"
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 107,
+    "completion_tokens": 3,
+    "total_tokens": 110
+  },
+  "system_fingerprint": "fp_0f03d4f0ee",
+  "code": 0,
+  "msg": "ok"
+}
+```
+
+### custom_request_headers
+
+`custom_request_headers` 选项允许您定义应包含在发送到模型 API 的请求中的自定义头信息。如果 API
+需要标准头信息之外的额外头信息，例如授权令牌、内容类型等，这个选项会非常有用。
+
+### custom_request_body
+
+`custom_request_body` 选项支持占位符：
+
+- `${model}`：用于模型名称的占位符。
+- `${input}`：用于确定输入值的占位符,同时根据 body value 的类型定义请求体请求类型。例如：`"${input}"` -> 
"input"。
+- `${prompt}`：用于 LLM 模型提示的占位符。
+
 ### common options [string]
 
 转换插件的常见参数, 请参考  [Transform Plugin](common-options.md) 了解详情
@@ -118,3 +174,83 @@ sink {
 }
 ```
 
+### Customize the LLM model
+
+```hocon
+env {
+  job.mode = "BATCH"
+}
+
+source {
+  FakeSource {
+    row.num = 5
+    schema = {
+      fields {
+        id = "int"
+        name = "string"
+      }
+    }
+    rows = [
+      {fields = [1, "Jia Fan"], kind = INSERT}
+      {fields = [2, "Hailin Wang"], kind = INSERT}
+      {fields = [3, "Tomas"], kind = INSERT}
+      {fields = [4, "Eric"], kind = INSERT}
+      {fields = [5, "Guangdong Liu"], kind = INSERT}
+    ]
+    result_table_name = "fake"
+  }
+}
+
+transform {
+  LLM {
+    source_table_name = "fake"
+    model_provider = CUSTOM
+    model = gpt-4o-mini
+    api_key = sk-xxx
+    prompt = "Determine whether someone is Chinese or American by their name"
+    openai.api_path = "http://mockserver:1080/v1/chat/completions";
+    custom_config={
+            custom_response_parse = "$.choices[*].message.content"
+            custom_request_headers = {
+                Content-Type = "application/json"
+                Authorization = "Bearer xxxxxxxx"            
+            }
+            custom_request_body ={
+                model = "${model}"
+                messages = [
+                {
+                    role = "system"
+                    content = "${prompt}"
+                },
+                {
+                    role = "user"
+                    content = "${input}"
+                }]
+            }
+        }
+    result_table_name = "llm_output"
+  }
+}
+
+sink {
+  Assert {
+    source_table_name = "llm_output"
+    rules =
+      {
+        field_rules = [
+          {
+            field_name = llm_output
+            field_type = string
+            field_value = [
+              {
+                rule_type = NOT_NULL
+              }
+            ]
+          }
+        ]
+      }
+  }
+}
+```
+
+
diff --git 
a/seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/llm_transform_custom.conf
 
b/seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/llm_transform_custom.conf
index ac9e58addb..8f23fa9c1b 100644
--- 
a/seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/llm_transform_custom.conf
+++ 
b/seatunnel-e2e/seatunnel-transforms-v2-e2e/seatunnel-transforms-v2-e2e-part-1/src/test/resources/llm_transform_custom.conf
@@ -53,16 +53,19 @@ transform {
     custom_config={
             custom_response_parse = "$.choices[*].message.content"
             custom_request_headers = {
-                111 = 222
+                Content-Type = "application/json"
+                Authorization = "Bearer b2e66711-10ed-495c-9f27-f233a8db09c2"
             }
             custom_request_body ={
                 model = "${model}"
-                messages = [{
-                role = "system"
-                content = "${prompt}"
-                },{
-                role = "user"
-                content = "${input}"
+                messages = [
+                {
+                    role = "system"
+                    content = "${prompt}"
+                },
+                {
+                    role = "user"
+                    content = "${input}"
                 }]
             }
         }
diff --git 
a/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java
 
b/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java
index a5ab4dc84a..c14877816f 100644
--- 
a/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java
+++ 
b/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/ModelProvider.java
@@ -21,7 +21,9 @@ import org.apache.commons.lang3.StringUtils;
 
 public enum ModelProvider {
     OPENAI("https://api.openai.com/v1/chat/completions";, 
"https://api.openai.com/v1/embeddings";),
-    DOUBAO("", "https://ark.cn-beijing.volces.com/api/v3/embeddings";),
+    DOUBAO(
+            "https://ark.cn-beijing.volces.com/api/v3/chat/completions";,
+            "https://ark.cn-beijing.volces.com/api/v3/embeddings";),
     QIANFAN("", 
"https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/embeddings";),
     CUSTOM("", ""),
     LOCAL("", "");
diff --git 
a/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/llm/LLMTransform.java
 
b/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/llm/LLMTransform.java
index 10bde8df51..92db061ccc 100644
--- 
a/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/llm/LLMTransform.java
+++ 
b/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/llm/LLMTransform.java
@@ -92,6 +92,7 @@ public class LLMTransform extends SingleFieldOutputTransform {
                                                 .CUSTOM_RESPONSE_PARSE));
                 break;
             case OPENAI:
+            case DOUBAO:
                 model =
                         new OpenAIModel(
                                 inputCatalogTable.getSeaTunnelRowType(),
@@ -102,7 +103,6 @@ public class LLMTransform extends 
SingleFieldOutputTransform {
                                 
provider.usedLLMPath(config.get(LLMTransformConfig.API_PATH)));
                 break;
             case QIANFAN:
-            case DOUBAO:
             default:
                 throw new IllegalArgumentException("Unsupported model 
provider: " + provider);
         }

(seatunnel) branch dev updated: [Feature] [Zeta] Optimized llm doc && add DOUBAO LLM (#7584)

Reply via email to