(fluss) 03/03: [docs] Add array type support documentation for Lance integration (#2579)

jark Sat, 07 Feb 2026 19:27:34 -0800

This is an automated email from the ASF dual-hosted git repository.

jark pushed a commit to branch release-0.9
in repository https://gitbox.apache.org/repos/asf/fluss.git


commit a1494f5a427ee567095e4a5ea71b78a0214bd60c
Author: ForwardXu <[email protected]>
AuthorDate: Sun Feb 8 11:18:06 2026 +0800

    [docs] Add array type support documentation for Lance integration (#2579)
---
 .../integrate-data-lakes/lance.md                  | 158 ++++++++++++++++++---
 1 file changed, 140 insertions(+), 18 deletions(-)

diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md 
b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
index af435973e..9475bfc08 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
@@ -108,21 +108,143 @@ Lance internally stores data in Arrow format.
 When integrating with Lance, Fluss automatically converts between Fluss data 
types and Lance data types.  
 The following table shows the mapping between [Fluss data 
types](table-design/data-types.md) and Lance data types:
 
-| Fluss Data Type               | Lance Data Type |
-|-------------------------------|-----------------|
-| BOOLEAN                       | Bool            |
-| TINYINT                       | Int8            |
-| SMALLINT                      | Int16           |
-| INT                           | Int32           |
-| BIGINT                        | Int64           |
-| FLOAT                         | Float32         |
-| DOUBLE                        | Float64         |
-| DECIMAL                       | Decimal128      |
-| STRING                        | Utf8            |
-| CHAR                          | Utf8            |
-| DATE                          | Date            |
-| TIME                          | Time            |
-| TIMESTAMP                     | Timestamp       |
-| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp       |
-| BINARY                        | FixedSizeBinary |
-| BYTES                         | Binary          |
\ No newline at end of file
+| Fluss Data Type               | Lance Data Type           |
+|-------------------------------|---------------------------|
+| BOOLEAN                       | Bool                      |
+| TINYINT                       | Int8                      |
+| SMALLINT                      | Int16                     |
+| INT                           | Int32                     |
+| BIGINT                        | Int64                     |
+| FLOAT                         | Float32                   |
+| DOUBLE                        | Float64                   |
+| DECIMAL                       | Decimal128                |
+| STRING                        | Utf8                      |
+| CHAR                          | Utf8                      |
+| DATE                          | Date                      |
+| TIME                          | Time                      |
+| TIMESTAMP                     | Timestamp                 |
+| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp                 |
+| BINARY                        | FixedSizeBinary           |
+| BYTES                         | Binary                    |
+| ARRAY\<t\>                    | List (or FixedSizeList)   |
+
+## Array Type Support
+
+Fluss supports the `ARRAY<t>` data type, which is particularly useful for 
machine learning and AI applications, such as storing vector embeddings.
+When tiering data to Lance, Fluss automatically converts `ARRAY` columns to 
Lance's List type (or FixedSizeList for fixed-dimension arrays).
+
+### Use Cases for Array Type
+
+The array type is especially valuable for:
+- **Vector Embeddings**: Store embeddings generated by machine learning models 
(e.g., text embeddings, image embeddings)
+- **Multi-dimensional Features**: Store feature vectors for recommendation 
systems or analytics
+- **Time Series Data**: Store sequences of numerical data points
+- **Batch Processing**: Store collections of related values together
+
+### Creating Tables with Array Columns
+
+You can create a Fluss table with array columns that will be tiered to Lance:
+
+```sql title="Flink SQL"
+CREATE TABLE product_embeddings (
+    product_id BIGINT,
+    product_name STRING,
+    embedding ARRAY<FLOAT>,
+    tags ARRAY<STRING>,
+    PRIMARY KEY (product_id) NOT ENFORCED
+) WITH (
+    'table.datalake.enabled' = 'true',
+    'table.datalake.freshness' = '30s'
+);
+```
+
+### Vector Embedding Example
+
+Here's a complete example of using Lance with Fluss for vector embeddings in a 
recommendation system:
+
+```sql title="Flink SQL"
+USE CATALOG fluss_catalog;
+
+-- Create a table to store product embeddings
+CREATE TABLE product_vectors (
+    product_id BIGINT,
+    product_name STRING,
+    category STRING,
+    embedding ARRAY<FLOAT>,
+    created_time TIMESTAMP
+) WITH (
+    'table.datalake.enabled' = 'true',
+    'table.datalake.freshness' = '1min',
+    'lance.max_row_per_file' = '1024'
+);
+
+-- Insert sample data with embeddings
+INSERT INTO product_vectors VALUES
+(1, 'Laptop', 'Electronics', ARRAY[0.23, 0.45, 0.67, 0.12, 0.89], 
CURRENT_TIMESTAMP),
+(2, 'Phone', 'Electronics', ARRAY[0.21, 0.43, 0.65, 0.15, 0.87], 
CURRENT_TIMESTAMP),
+(3, 'Book', 'Media', ARRAY[0.78, 0.32, 0.11, 0.54, 0.23], CURRENT_TIMESTAMP);
+```
+
+### Reading Array Data from Lance
+
+Once data is tiered to Lance, you can use Lance's vector search capabilities 
to perform similarity searches:
+
+```python title="Lance Python with Vector Search"
+import lance
+import numpy as np
+
+# Connect to the Lance dataset
+ds = lance.dataset("s3://my-bucket/my_database/product_vectors.lance")
+
+# Read all data
+df = ds.to_table().to_pandas()
+print(df)
+
+# Perform vector similarity search
+# Query vector (e.g., embedding of a search query)
+query_vector = np.array([0.22, 0.44, 0.66, 0.13, 0.88])
+
+# Search for similar products using Lance's vector index
+results = ds.to_table(
+    nearest={
+        "column": "embedding",
+        "q": query_vector,
+        "k": 5
+    }
+).to_pandas()
+
+print("Top 5 similar products:")
+print(results[['product_id', 'product_name', 'category']])
+```
+
+### Advanced Usage with Fixed-Size Arrays
+
+For optimal performance with vector embeddings, you can use fixed-size arrays:
+
+```sql title="Flink SQL"
+CREATE TABLE image_embeddings (
+    image_id BIGINT,
+    image_url STRING,
+    embedding ARRAY<FLOAT>,
+    metadata ROW<width INT, height INT, format STRING>
+) WITH (
+    'table.datalake.enabled' = 'true',
+    'table.datalake.freshness' = '30s'
+);
+```
+
+### Best Practices
+
+When working with array columns in Lance:
+
+1. **Fixed Dimensions**: Keep embedding dimensions consistent across all rows 
for better performance
+2. **Data Type**: Use `FLOAT` or `DOUBLE` for numerical embeddings, depending 
on precision requirements
+3. **Batch Size**: Adjust `lance.max_row_per_file` based on your embedding 
size and query patterns
+4. **Freshness**: Set `table.datalake.freshness` appropriately for your use 
case (balance between latency and write efficiency)
+
+### Performance Considerations
+
+- Lance stores arrays efficiently using Arrow's columnar format
+- Fixed-size arrays (when all elements have the same length) can be stored 
more efficiently
+- Lance supports vector indexes for fast similarity search on array columns
+- For large embeddings (e.g., 1024+ dimensions), consider increasing Flink 
Task Manager's off-heap memory
\ No newline at end of file

(fluss) 03/03: [docs] Add array type support documentation for Lance integration (#2579)

Reply via email to