corgy-w commented on issue #7192:
URL: https://github.com/apache/seatunnel/issues/7192#issuecomment-2351419854

   # Weaviate
   
   ## Data Format
   
   Let's consider the vector data format from Milvus as follows:
   
   | Field          | Type         | Description |
   | -------------- | ------------ | ----------- |
   | book_name      | string       | andersen    |
   | book_id        | int64        | 1           |
   | summary        | string       | fairy tale  |
   | summary_vector | float_vector | .....       |
   | name_vector    | float_vector | .....       |
   
   When synchronizing to Weaviate, several challenges arise.
   
   First, let’s refer to the [Weaviate data 
format](https://weaviate.io/developers/weaviate/manage-data/collections):
   
   ```
   Schema
     ├── Class 1
     │     ├── Property 1
     │     ├── Property 2
     │     └── Object/Instance 1, Object/Instance 2, ...
     ├── Class 2
     │     ├── Property 1
     │     ├── Property 2
     │     └── Object/Instance 1, Object/Instance 2, ...
     └── Class 3
           ├── Property 1
           ├── Property 2
           └── Object/Instance 1, Object/Instance 2, ...
   ```
   
   In this mapping:
   
   - Milvus table name corresponds to the Weaviate class name.
   - Milvus field name corresponds to the Weaviate property name.
   
   Vectors in Weaviate are treated as class properties, where a class can have 
one vector field representing a specific attribute.
   
   ### Preliminary Design:
   
   1. **As Sink**
   
      1. **Data without existing vector fields**:
   
         Weaviate's vector database has an embedded OpenAI vector 
transformation engine, allowing you to specify a field to automatically 
generate vectors, which can then be synchronized into Weaviate.
   
      2. **Data with one vector field**:
   
         You can manually specify which field is the primary attribute and 
which one holds the vector data.
   
         Example:
   
         If `book_name` is the main field and `name_vector` is the vector field:
   
         ```java
         float[] customVector = {0.1f, 0.2f, 0.3f};
         object.set(bookName, true);  // true indicates the vector field
         object.setVector(customVector);  // manually set the vector, bypassing 
Weaviate auto-vectorization
         ```
   
      3. **Data with multiple vector fields**:
   
         In cases where there are two or more vector fields, it’s necessary to 
ignore additional vectors beyond the configured field.
   
   2. **As Source**
   
      1. You need to specify which field within the class corresponds to the 
primary vector field; otherwise, data retrieval won’t work as expected.
   
         The vector field name should be configured appropriately for data 
access.
   
   ---
   
   ## Application Scenarios
   
   **Weaviate for fast embedded vector processing, Milvus for large-scale 
vector management**: Weaviate provides built-in vectorizers (such as 
transformers, Cohere, OpenAI) that automatically convert text, images, and 
other data into vectors, and manage the entire process of storage and querying. 
This makes it especially suited for small-scale data analysis and embedded 
vector processing. Projects that require a quick setup and simplified data 
processing workflows benefit from Weaviate’s all-in-one approach, allowing easy 
implementation of embedding transformations.**Therefore, I prefer the first use 
of sink**
   
   On the other hand, **Milvus excels in managing large-scale vector data**. It 
delegates the vector generation process to external tools, allowing users 
greater flexibility in controlling vector creation and insertion, supporting 
efficient vector retrieval and complex queries. This makes Milvus ideal for 
handling large volumes of vector data, where custom control over the data 
pipeline is essential. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to