corgy-w commented on issue #7192: URL: https://github.com/apache/seatunnel/issues/7192#issuecomment-2351419854
# Weaviate ## Data Format Let's consider the vector data format from Milvus as follows: | Field | Type | Description | | -------------- | ------------ | ----------- | | book_name | string | andersen | | book_id | int64 | 1 | | summary | string | fairy tale | | summary_vector | float_vector | ..... | | name_vector | float_vector | ..... | When synchronizing to Weaviate, several challenges arise. First, let’s refer to the [Weaviate data format](https://weaviate.io/developers/weaviate/manage-data/collections): ``` Schema ├── Class 1 │ ├── Property 1 │ ├── Property 2 │ └── Object/Instance 1, Object/Instance 2, ... ├── Class 2 │ ├── Property 1 │ ├── Property 2 │ └── Object/Instance 1, Object/Instance 2, ... └── Class 3 ├── Property 1 ├── Property 2 └── Object/Instance 1, Object/Instance 2, ... ``` In this mapping: - Milvus table name corresponds to the Weaviate class name. - Milvus field name corresponds to the Weaviate property name. Vectors in Weaviate are treated as class properties, where a class can have one vector field representing a specific attribute. ### Preliminary Design: 1. **As Sink** 1. **Data without existing vector fields**: Weaviate's vector database has an embedded OpenAI vector transformation engine, allowing you to specify a field to automatically generate vectors, which can then be synchronized into Weaviate. 2. **Data with one vector field**: You can manually specify which field is the primary attribute and which one holds the vector data. Example: If `book_name` is the main field and `name_vector` is the vector field: ```java float[] customVector = {0.1f, 0.2f, 0.3f}; object.set(bookName, true); // true indicates the vector field object.setVector(customVector); // manually set the vector, bypassing Weaviate auto-vectorization ``` 3. **Data with multiple vector fields**: In cases where there are two or more vector fields, it’s necessary to ignore additional vectors beyond the configured field. 2. **As Source** 1. You need to specify which field within the class corresponds to the primary vector field; otherwise, data retrieval won’t work as expected. The vector field name should be configured appropriately for data access. --- ## Application Scenarios **Weaviate for fast embedded vector processing, Milvus for large-scale vector management**: Weaviate provides built-in vectorizers (such as transformers, Cohere, OpenAI) that automatically convert text, images, and other data into vectors, and manage the entire process of storage and querying. This makes it especially suited for small-scale data analysis and embedded vector processing. Projects that require a quick setup and simplified data processing workflows benefit from Weaviate’s all-in-one approach, allowing easy implementation of embedding transformations.**Therefore, I prefer the first use of sink** On the other hand, **Milvus excels in managing large-scale vector data**. It delegates the vector generation process to external tools, allowing users greater flexibility in controlling vector creation and insertion, supporting efficient vector retrieval and complex queries. This makes Milvus ideal for handling large volumes of vector data, where custom control over the data pipeline is essential. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
