loupipalien opened a new issue, #9611:
URL: https://github.com/apache/seatunnel/issues/9611

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   AbstractModel#vectorization 
([link](https://github.com/apache/seatunnel/blob/dev/seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/remote/AbstractModel.java#L42))
 method return a double(8 bytes) array vector
   but elasticsearch 
sink([link](https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/connector-elasticsearch/src/main/java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/serialize/ElasticsearchRowSerializer.java#L221))
 vector fields to a float(4 bytes) array,  it cause vector dimensions double 
and contains a half zeros
   
   <img width="2824" height="670" alt="Image" 
src="https://github.com/user-attachments/assets/30162cd0-8c28-4bc9-bfab-a4e0b7650d07";
 />
   
   ### SeaTunnel Version
   
   dev/2.3.12-SNAPSHOT
   
   ### SeaTunnel Config
   
   ```conf
   env {
     parallelism = 1
     job.mode = "BATCH"
   }
   
   source {
     S3File {
       path = "/seatunnel/test_csv_data.csv"
       bucket = "s3a://ltchen"
       fs.s3a.endpoint="tos-s3-cn-beijing.volces.com"
       
fs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
       file_format_type = "csv"
       access_key="xxx"
       secret_key="xxx",
       csv_use_header_line = true,
       field_delimiter = ","
       schema={
           fields {
               code = int
               data = string
               success = boolean
           }
       }
     }
   }
   
   transform {
     Embedding {
       model_provider = "DOUBAO"
       model = "doubao-embedding-text-240715"
       api_key = "xxx"
       secret_key = "xxx"
       vectorization_fields {
           data_vector = data
       },
       custom_config={
         custom_response_parse = "$.data[*].embedding"
         custom_request_headers = {
             "Content-Type"= "application/json"
             "Authorization"= "Bearer ${api_key}"
         }
         custom_request_body ={
             model = "${model}"
             input = ["${input}"]
         }
       }
     }
   }
   
   sink {
       Elasticsearch {
           hosts = ["http://127.0.0.1:9200";]
           index = "seatunnel-ltchen-embedding"
           schema_save_mode="RECREATE_SCHEMA"
           data_save_mode="APPEND_DATA"
           vectorization_fields = ["data_vector"]
       }
   }
   ```
   
   ### Running Command
   
   ```shell
   bin/seatunnel.sh -c jobs/s32es_embedding.conf -m local
   ```
   
   ### Error Exception
   
   ```log
   no exception
   ```
   
   ### Zeta or Flink or Spark Version
   
   _No response_
   
   ### Java or Scala Version
   
   _No response_
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to