jerryshao commented on code in PR #9173: URL: https://github.com/apache/gravitino/pull/9173#discussion_r2605063661
########## docs/lakehouse-generic-lance-table.md: ########## @@ -0,0 +1,327 @@ +--- +title: "Generic lakehouse catalog with Lance" +slug: /lakehouse-generic-catalog-with-lance +keywords: +- lakehouse +- lance +- metadata +- generic catalog +- file system +license: "This software is licensed under the Apache License version 2." +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + + +## Overview + +This document describes how to use Apache Gravitino to manage a generic lakehouse catalog using Lance as the underlying table format. + + +## Table Management + +### Supported Operations + +For Lance tables in a Generic Lakehouse Catalog, the following table summarizes supported operations: + +| Operation | Support Status | +|-----------|----------------| +| List | ✅ Full | +| Load | ✅ Full | +| Alter | No support now | +| Create | ✅ Full | +| Register | ✅ Full | +| Drop | ✅ Full | +| Truncate | ✅ Full | + +:::note Feature Limitations +- **Partitioning:** Not currently supported +- **Sort Orders:** Not currently supported +- **Distributions:** Not currently supported + ::: + +### Data Type Mappings + +Lance uses Apache Arrow for table schemas. The following table shows type mappings between Gravitino and Arrow: + +| Gravitino Type | Arrow Type | +|----------------------------------|-----------------------------------------| +| `Struct` | `Struct` | +| `Map` | `Map` | +| `List` | `Array` | +| `Boolean` | `Boolean` | +| `Byte` | `Int8` | +| `Short` | `Int16` | +| `Integer` | `Int32` | +| `Long` | `Int64` | +| `Float` | `Float` | +| `Double` | `Double` | +| `String` | `Utf8` | +| `Binary` | `Binary` | +| `Decimal(p, s)` | `Decimal(p, s)` (128-bit) | +| `Date` | `Date` | +| `Timestamp`/`Timestamp(6)` | `TimestampType withoutZone` | +| `Timestamp(0)` | `TimestampType Second withoutZone` | +| `Timestamp(3)` | `TimestampType Millisecond withoutZone` | +| `Timestamp(9)` | `TimestampType Nanosecond withoutZone` | +| `Timestamp_tz`/`Timestamp_tz(6)` | `TimestampType Microsecond withUtc` | +| `Timestamp_tz(0)` | `TimestampType Second withUtc` | +| `Timestamp_tz(3)` | `TimestampType Millisecond withUtc` | +| `Timestamp_tz(9)` | `TimestampType Nanosecond withUtc` | +| `Time`/`Time(9)` | `Time Nanosecond` | +| `Null` | `Null` | +| `Fixed(n)` | `Fixed-Size Binary(n)` | +| `Interval_year` | `Interval(YearMonth)` | +| `Interval_day` | `Duration(Microsecond)` | +| `External(arrow_field_json_str)` | Any Arrow Field | + +### External Type Support + +For Arrow types not natively mapped in Gravitino, use the `External(arrow_field_json_str)` type, which accepts a JSON string representation of an Arrow `Field`. + +**Requirements:** +- JSON must conform to Apache Arrow [Field specification](https://github.com/apache/arrow-java/blob/ed81e5981a2bee40584b3a411ed755cb4cc5b91f/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L80C1-L86C68) +- `name` attribute must match column name exactly +- `nullable` attribute must match column nullability +- `children` array: + - Empty for primitive types + - Contains child field definitions for complex types (Struct, List) + +**Examples:** + +| Arrow Type | External Type Definition | +|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `Large Utf8` | `External("{\"name\":\"col_name\",\"nullable\":true,\"type\":{\"name\":\"largeutf8\"},\"children\":[]}")` | +| `Large Binary` | `External("{\"name\":\"col_name\",\"nullable\":true,\"type\":{\"name\":\"largebinary\"},\"children\":[]}")` | +| `Large List` | `External("{\"name\":\"col_name\",\"nullable\":true,\"type\":{\"name\":\"largelist\"},\"children\":[{\"name\":\"element\",\"nullable\":true,\"type\":{\"name\":\"int\",\"bitWidth\":32,\"isSigned\":true},\"children\":[]}]}")` | +| `Fixed-Size List` | `External("{\"name\":\"col_name\",\"nullable\":true,\"type\":{\"name\":\"fixedsizelist\",\"listSize\":10},\"children\":[{\"name\":\"element\",\"nullable\":true,\"type\":{\"name\":\"int\",\"bitWidth\":32,\"isSigned\":true},\"children\":[]}]}")` | + +### Table Properties + +Required and optional properties for tables in a Generic Lakehouse Catalog: + +| Property | Description | Default | Required | Since Version | +|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|---------------| +| `format` | Table format: `lance`, `iceberg`, etc. (currently only `lance` is fully supported) | (none) | Yes | 1.1.0 | +| `location` | Storage path for table metadata and data, Lance currently supports: S3, GCS, OSS, AZ, File, Memory and file-object-store. | (none) | Conditional* | 1.1.0 | +| `external` | Whether the data directory is an external location. If it's `true`, dropping a table will only remove metadata in Gravitino and will not delete the data directory, and purge table will delete both. For a non-external table, dropping will drop both. | false | No | 1.1.0 | +| `lance.creation-mode` | Create mode: for create table, it can be `CREATE`, `EXIST_OK` or `OVERWRITE`. and it should be `CREATE` or `OVERWRITE` for registering tables | `CREATE` | No | 1.1.0 | +| `lance.register` | Whether it is a register table operation. This API will not create data directory actually and it's the user's responsibility to create and manage the data directory. | false | No | 1.1.0 | +| `lance.storage.xxxx` | Any additional storage-specific properties required by Lance format (e.g., S3 credentials, HDFS configs). Replace `xxxx` with actual property names. For example, we can use `lance.storage.aws_access_key_id` to set S3 aws_access_key_id when using a S3 location, for detail, please refer to https://lancedb.com/docs/storage/integrations/ | (none) | No | 1.1.0 | + + +**Location Requirement:** Must be specified at catalog, schema, or table level. See [Location Resolution](./lakehouse-generic-catalog.md#key-property-location). + +You may also set additional properties specific to your lakehouse format or custom requirements. + +### Index Support + +Index capabilities vary by lakehouse format. The following table shows Lance format support: + +| Index Type | Description | Lance Support | +|------------|----------------------------------------------------------------------------------------------|---------------| +| SCALAR | Optimizes searches on scalar data types (integers, floats, etc.) | ✅ | +| VECTOR | Optimizes similarity searches in high-dimensional vector spaces | ✅ | +| BTREE | Balanced tree for sorted data with logarithmic search/insert/delete complexity | ✅ | +| INVERTED | Full-text search optimization through term-to-location mapping | ✅ | +| IVF_FLAT | Vector search with inverted file and flat quantization | ✅ | +| IVF_SQ | Vector search with scalar quantization for memory efficiency | ✅ | +| IVF_PQ | Vector search with product quantization balancing accuracy and memory | ✅ | + +:::caution Index Creation Limitation +**Lance tables do not support index creation during table creation.** You must: +1. Create the table first +2. Then create indexes on the created table Review Comment: How do users do this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
