This is an automated email from the ASF dual-hosted git repository.
jark pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new cf8bb23e2 [docs] Add array type support documentation for Lance
integration (#2579)
cf8bb23e2 is described below
commit cf8bb23e2fc26ff4624ac75217e5279f0a6bf4f6
Author: ForwardXu <[email protected]>
AuthorDate: Sun Feb 8 11:18:06 2026 +0800
[docs] Add array type support documentation for Lance integration (#2579)
---
.../integrate-data-lakes/lance.md | 158 ++++++++++++++++++---
1 file changed, 140 insertions(+), 18 deletions(-)
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
index af435973e..9475bfc08 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
@@ -108,21 +108,143 @@ Lance internally stores data in Arrow format.
When integrating with Lance, Fluss automatically converts between Fluss data
types and Lance data types.
The following table shows the mapping between [Fluss data
types](table-design/data-types.md) and Lance data types:
-| Fluss Data Type | Lance Data Type |
-|-------------------------------|-----------------|
-| BOOLEAN | Bool |
-| TINYINT | Int8 |
-| SMALLINT | Int16 |
-| INT | Int32 |
-| BIGINT | Int64 |
-| FLOAT | Float32 |
-| DOUBLE | Float64 |
-| DECIMAL | Decimal128 |
-| STRING | Utf8 |
-| CHAR | Utf8 |
-| DATE | Date |
-| TIME | Time |
-| TIMESTAMP | Timestamp |
-| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp |
-| BINARY | FixedSizeBinary |
-| BYTES | Binary |
\ No newline at end of file
+| Fluss Data Type | Lance Data Type |
+|-------------------------------|---------------------------|
+| BOOLEAN | Bool |
+| TINYINT | Int8 |
+| SMALLINT | Int16 |
+| INT | Int32 |
+| BIGINT | Int64 |
+| FLOAT | Float32 |
+| DOUBLE | Float64 |
+| DECIMAL | Decimal128 |
+| STRING | Utf8 |
+| CHAR | Utf8 |
+| DATE | Date |
+| TIME | Time |
+| TIMESTAMP | Timestamp |
+| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp |
+| BINARY | FixedSizeBinary |
+| BYTES | Binary |
+| ARRAY\<t\> | List (or FixedSizeList) |
+
+## Array Type Support
+
+Fluss supports the `ARRAY<t>` data type, which is particularly useful for
machine learning and AI applications, such as storing vector embeddings.
+When tiering data to Lance, Fluss automatically converts `ARRAY` columns to
Lance's List type (or FixedSizeList for fixed-dimension arrays).
+
+### Use Cases for Array Type
+
+The array type is especially valuable for:
+- **Vector Embeddings**: Store embeddings generated by machine learning models
(e.g., text embeddings, image embeddings)
+- **Multi-dimensional Features**: Store feature vectors for recommendation
systems or analytics
+- **Time Series Data**: Store sequences of numerical data points
+- **Batch Processing**: Store collections of related values together
+
+### Creating Tables with Array Columns
+
+You can create a Fluss table with array columns that will be tiered to Lance:
+
+```sql title="Flink SQL"
+CREATE TABLE product_embeddings (
+ product_id BIGINT,
+ product_name STRING,
+ embedding ARRAY<FLOAT>,
+ tags ARRAY<STRING>,
+ PRIMARY KEY (product_id) NOT ENFORCED
+) WITH (
+ 'table.datalake.enabled' = 'true',
+ 'table.datalake.freshness' = '30s'
+);
+```
+
+### Vector Embedding Example
+
+Here's a complete example of using Lance with Fluss for vector embeddings in a
recommendation system:
+
+```sql title="Flink SQL"
+USE CATALOG fluss_catalog;
+
+-- Create a table to store product embeddings
+CREATE TABLE product_vectors (
+ product_id BIGINT,
+ product_name STRING,
+ category STRING,
+ embedding ARRAY<FLOAT>,
+ created_time TIMESTAMP
+) WITH (
+ 'table.datalake.enabled' = 'true',
+ 'table.datalake.freshness' = '1min',
+ 'lance.max_row_per_file' = '1024'
+);
+
+-- Insert sample data with embeddings
+INSERT INTO product_vectors VALUES
+(1, 'Laptop', 'Electronics', ARRAY[0.23, 0.45, 0.67, 0.12, 0.89],
CURRENT_TIMESTAMP),
+(2, 'Phone', 'Electronics', ARRAY[0.21, 0.43, 0.65, 0.15, 0.87],
CURRENT_TIMESTAMP),
+(3, 'Book', 'Media', ARRAY[0.78, 0.32, 0.11, 0.54, 0.23], CURRENT_TIMESTAMP);
+```
+
+### Reading Array Data from Lance
+
+Once data is tiered to Lance, you can use Lance's vector search capabilities
to perform similarity searches:
+
+```python title="Lance Python with Vector Search"
+import lance
+import numpy as np
+
+# Connect to the Lance dataset
+ds = lance.dataset("s3://my-bucket/my_database/product_vectors.lance")
+
+# Read all data
+df = ds.to_table().to_pandas()
+print(df)
+
+# Perform vector similarity search
+# Query vector (e.g., embedding of a search query)
+query_vector = np.array([0.22, 0.44, 0.66, 0.13, 0.88])
+
+# Search for similar products using Lance's vector index
+results = ds.to_table(
+ nearest={
+ "column": "embedding",
+ "q": query_vector,
+ "k": 5
+ }
+).to_pandas()
+
+print("Top 5 similar products:")
+print(results[['product_id', 'product_name', 'category']])
+```
+
+### Advanced Usage with Fixed-Size Arrays
+
+For optimal performance with vector embeddings, you can use fixed-size arrays:
+
+```sql title="Flink SQL"
+CREATE TABLE image_embeddings (
+ image_id BIGINT,
+ image_url STRING,
+ embedding ARRAY<FLOAT>,
+ metadata ROW<width INT, height INT, format STRING>
+) WITH (
+ 'table.datalake.enabled' = 'true',
+ 'table.datalake.freshness' = '30s'
+);
+```
+
+### Best Practices
+
+When working with array columns in Lance:
+
+1. **Fixed Dimensions**: Keep embedding dimensions consistent across all rows
for better performance
+2. **Data Type**: Use `FLOAT` or `DOUBLE` for numerical embeddings, depending
on precision requirements
+3. **Batch Size**: Adjust `lance.max_row_per_file` based on your embedding
size and query patterns
+4. **Freshness**: Set `table.datalake.freshness` appropriately for your use
case (balance between latency and write efficiency)
+
+### Performance Considerations
+
+- Lance stores arrays efficiently using Arrow's columnar format
+- Fixed-size arrays (when all elements have the same length) can be stored
more efficiently
+- Lance supports vector indexes for fast similarity search on array columns
+- For large embeddings (e.g., 1024+ dimensions), consider increasing Flink
Task Manager's off-heap memory
\ No newline at end of file