Re: [PR] [Feature][Transform] Add embedding transform [seatunnel]

via GitHub Mon, 02 Sep 2024 05:14:48 -0700


Hisoka-X commented on code in PR #7534:
URL: https://github.com/apache/seatunnel/pull/7534#discussion_r1740810870



##########
docs/zh/transform-v2/embedding.md:
##########
@@ -0,0 +1,356 @@
+# Embedding
+
+> Embedding Transform Plugin
+
+## 描述
+
+`Embedding` 转换插件利用 embedding 
模型将文本数据转换为向量化表示。此转换可以应用于各种字段。该插件支持多种模型提供商，并且可以与不同的API集成。
+
+## 配置选项
+
+| 名称                       | 类型     | 是否必填 | 默认值 | 描述                          
                                     |
+|--------------------------|--------|------|-----|------------------------------------------------------------------|
+| model_provider           | enum   | 是    | -   | embedding模型的提供商。可选项包括 
`QIANFAN`、`OPENAI` 等。                      |
+| api_key                  | string | 是    | -   | 用于验证embedding服务的API密钥。      
                                     |
+| secret_key               | string | 是    | -   | 
用于额外验证的密钥。一些提供商可能需要此密钥进行安全的API请求。                                |
+| single_vectorized_input_number | int    | 否    | 1   | 单次请求向量化的输入数量。默认值为1。   
                                           |
+| vectorization_fields     | map    | 是    | -   | 输入字段和相应的输出向量字段之间的映射。        
                                     |
+| model                    | string | 是    | -   | 
要使用的具体embedding模型。例如，如果提供商为OPENAI，可以指定 `text-embedding-3-small`。 |
+| api_path                 | string | 否    | -   | embedding服务的API。通常由模型提供商提供。 
                                     |
+| oauth_path               | string | 否    | -   | oauth 服务的 API 。             
                                     |
+| custom_config            | map    | 否    |     | 模型的自定义配置。                   
                                     |
+| custom_response_parse    | string | 否    |     | 使用 JsonPath 
解析模型响应的方式。示例：`$.choices[*].message.content`。         |
+| custom_request_headers   | map    | 否    |     | 发送到模型的请求的自定义头信息。            
                                     |
+| custom_request_body      | map    | 否    |     | 请求体的自定义配置。支持占位符如 
`${model}`、`${input}`、`${prompt}`。              |
+
+### embedding_model_provider
+
+用于生成 embedding 的模型提供商。常见选项可能包括 `QIANFAN`、`OPENAI` 等。根据提供商的不同，可用的模型和API路径也可能不同。
+
+### api_key
+
+用于验证 embedding 服务请求的API密钥。通常由模型提供商在你注册他们的服务时提供。
+
+### secret_key
+
+用于额外验证的密钥。一些提供商可能要求此密钥以确保API请求的安全性。
+
+### single_vectorized_input_number
+
+指定单次请求向量化的输入数量。默认值为1。根据处理能力和模型提供商的API限制进行调整。
+
+### vectorization_fields
+
+输入字段和相应的输出向量字段之间的映射。这使得插件可以理解要向量化的文本字段以及如何存储生成的向量。
+
+```hocon
+vectorization_fields {
+    book_intro_vector = book_intro
+    author_biography_vector  = author_biography
+}
+```
+
+### model
+
+要使用的具体 embedding 模型。这取决于`embedding_model_provider`。例如，如果使用 OPENAI ，可以指定 
`text-embedding-3-small`。
+
+### api_path
+
+用于向 embedding 服务发送请求的API。根据提供商和所用模型的不同可能有所变化。通常由模型提供商提供。
+
+### oauth_path
+
+用于向oauth服务发送请求的API,获取对应的认证信息。根据提供商和所用模型的不同可能有所变化。通常由模型提供商提供。
+
+### custom_config
+
+`custom_config` 选项允许您为模型提供额外的自定义配置。这是一个映射，您可以在其中定义特定模型可能需要的各种设置。
+
+### custom_response_parse
+
+`custom_response_parse` 选项允许您指定如何解析模型的响应。您可以使用 JsonPath

Review Comment:
   How about link to jsonpath? So user can know how to configure it.



##########
docs/en/transform-v2/embedding.md:
##########
@@ -0,0 +1,366 @@
+# Embedding
+
+> Embedding Transform Plugin
+
+## Description
+
+The `Embedding` transform plugin leverages embedding models to convert text 
data into vectorized representations. This
+transformation can be applied to various fields. The plugin supports multiple 
model providers and can be integrated with
+different API endpoints.
+
+## Options
+
+| Name                           | Type   | Required | Default Value | 
Description                                                                     
                            |
+|--------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------------|
+| model_provider                 | enum   | yes      | -             | The 
model provider for embedding. Options may include `QIANFAN`, `OPENAI`, etc.     
                        |
+| api_key                        | string | yes      | -             | The API 
key required to authenticate with the embedding service.                        
                    |
+| secret_key                     | string | yes      | -             | The 
secret key required for additional authentication with the embedding service.   
                        |
+| single_vectorized_input_number | int    | no       | 1             | The 
number of inputs vectorized in one request. Default is 1.                       
                        |
+| vectorization_fields           | map    | yes      | -             | A 
mapping between input fields and their corresponding output vector fields.      
                          |
+| model                          | string | yes      | -             | The 
specific model to use for embedding (e.g: `text-embedding-3-small` for OPENAI). 
                        |
+| api_path                       | string | no       | -             | The API 
endpoint for the embedding service. Typically provided by the model provider.   
                    |
+| oauth_path                     | string | no       | -             | The API 
endpoint for the oauth service.                                                 
                    |
+| custom_config                  | map    | no       |               | Custom 
configurations for the model.                                                   
                     |
+| custom_response_parse          | string | no       |               | 
Specifies how to parse the response from the model using JsonPath. Example: 
`$.choices[*].message.content`. |
+| custom_request_headers         | map    | no       |               | Custom 
headers for the request to the model.                                           
                     |
+| custom_request_body            | map    | no       |               | Custom 
body for the request. Supports placeholders like `${model}`, `${input}`, 
`${prompt}`.                |
+
+### model_provider
+
+The model provider to use for generating embeddings. Common options might 
include `QIANFAN`, `OPENAI`, etc. Depending on

Review Comment:
   Why not list all supported model?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Transform] Add embedding transform [seatunnel]

Reply via email to