This is an automated email from the ASF dual-hosted git repository. jark pushed a commit to branch release-0.9 in repository https://gitbox.apache.org/repos/asf/fluss.git
commit a1494f5a427ee567095e4a5ea71b78a0214bd60c Author: ForwardXu <[email protected]> AuthorDate: Sun Feb 8 11:18:06 2026 +0800 [docs] Add array type support documentation for Lance integration (#2579) --- .../integrate-data-lakes/lance.md | 158 ++++++++++++++++++--- 1 file changed, 140 insertions(+), 18 deletions(-) diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md index af435973e..9475bfc08 100644 --- a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md +++ b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md @@ -108,21 +108,143 @@ Lance internally stores data in Arrow format. When integrating with Lance, Fluss automatically converts between Fluss data types and Lance data types. The following table shows the mapping between [Fluss data types](table-design/data-types.md) and Lance data types: -| Fluss Data Type | Lance Data Type | -|-------------------------------|-----------------| -| BOOLEAN | Bool | -| TINYINT | Int8 | -| SMALLINT | Int16 | -| INT | Int32 | -| BIGINT | Int64 | -| FLOAT | Float32 | -| DOUBLE | Float64 | -| DECIMAL | Decimal128 | -| STRING | Utf8 | -| CHAR | Utf8 | -| DATE | Date | -| TIME | Time | -| TIMESTAMP | Timestamp | -| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp | -| BINARY | FixedSizeBinary | -| BYTES | Binary | \ No newline at end of file +| Fluss Data Type | Lance Data Type | +|-------------------------------|---------------------------| +| BOOLEAN | Bool | +| TINYINT | Int8 | +| SMALLINT | Int16 | +| INT | Int32 | +| BIGINT | Int64 | +| FLOAT | Float32 | +| DOUBLE | Float64 | +| DECIMAL | Decimal128 | +| STRING | Utf8 | +| CHAR | Utf8 | +| DATE | Date | +| TIME | Time | +| TIMESTAMP | Timestamp | +| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp | +| BINARY | FixedSizeBinary | +| BYTES | Binary | +| ARRAY\<t\> | List (or FixedSizeList) | + +## Array Type Support + +Fluss supports the `ARRAY<t>` data type, which is particularly useful for machine learning and AI applications, such as storing vector embeddings. +When tiering data to Lance, Fluss automatically converts `ARRAY` columns to Lance's List type (or FixedSizeList for fixed-dimension arrays). + +### Use Cases for Array Type + +The array type is especially valuable for: +- **Vector Embeddings**: Store embeddings generated by machine learning models (e.g., text embeddings, image embeddings) +- **Multi-dimensional Features**: Store feature vectors for recommendation systems or analytics +- **Time Series Data**: Store sequences of numerical data points +- **Batch Processing**: Store collections of related values together + +### Creating Tables with Array Columns + +You can create a Fluss table with array columns that will be tiered to Lance: + +```sql title="Flink SQL" +CREATE TABLE product_embeddings ( + product_id BIGINT, + product_name STRING, + embedding ARRAY<FLOAT>, + tags ARRAY<STRING>, + PRIMARY KEY (product_id) NOT ENFORCED +) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s' +); +``` + +### Vector Embedding Example + +Here's a complete example of using Lance with Fluss for vector embeddings in a recommendation system: + +```sql title="Flink SQL" +USE CATALOG fluss_catalog; + +-- Create a table to store product embeddings +CREATE TABLE product_vectors ( + product_id BIGINT, + product_name STRING, + category STRING, + embedding ARRAY<FLOAT>, + created_time TIMESTAMP +) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '1min', + 'lance.max_row_per_file' = '1024' +); + +-- Insert sample data with embeddings +INSERT INTO product_vectors VALUES +(1, 'Laptop', 'Electronics', ARRAY[0.23, 0.45, 0.67, 0.12, 0.89], CURRENT_TIMESTAMP), +(2, 'Phone', 'Electronics', ARRAY[0.21, 0.43, 0.65, 0.15, 0.87], CURRENT_TIMESTAMP), +(3, 'Book', 'Media', ARRAY[0.78, 0.32, 0.11, 0.54, 0.23], CURRENT_TIMESTAMP); +``` + +### Reading Array Data from Lance + +Once data is tiered to Lance, you can use Lance's vector search capabilities to perform similarity searches: + +```python title="Lance Python with Vector Search" +import lance +import numpy as np + +# Connect to the Lance dataset +ds = lance.dataset("s3://my-bucket/my_database/product_vectors.lance") + +# Read all data +df = ds.to_table().to_pandas() +print(df) + +# Perform vector similarity search +# Query vector (e.g., embedding of a search query) +query_vector = np.array([0.22, 0.44, 0.66, 0.13, 0.88]) + +# Search for similar products using Lance's vector index +results = ds.to_table( + nearest={ + "column": "embedding", + "q": query_vector, + "k": 5 + } +).to_pandas() + +print("Top 5 similar products:") +print(results[['product_id', 'product_name', 'category']]) +``` + +### Advanced Usage with Fixed-Size Arrays + +For optimal performance with vector embeddings, you can use fixed-size arrays: + +```sql title="Flink SQL" +CREATE TABLE image_embeddings ( + image_id BIGINT, + image_url STRING, + embedding ARRAY<FLOAT>, + metadata ROW<width INT, height INT, format STRING> +) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s' +); +``` + +### Best Practices + +When working with array columns in Lance: + +1. **Fixed Dimensions**: Keep embedding dimensions consistent across all rows for better performance +2. **Data Type**: Use `FLOAT` or `DOUBLE` for numerical embeddings, depending on precision requirements +3. **Batch Size**: Adjust `lance.max_row_per_file` based on your embedding size and query patterns +4. **Freshness**: Set `table.datalake.freshness` appropriately for your use case (balance between latency and write efficiency) + +### Performance Considerations + +- Lance stores arrays efficiently using Arrow's columnar format +- Fixed-size arrays (when all elements have the same length) can be stored more efficiently +- Lance supports vector indexes for fast similarity search on array columns +- For large embeddings (e.g., 1024+ dimensions), consider increasing Flink Task Manager's off-heap memory \ No newline at end of file
