jerryshao commented on code in PR #9623: URL: https://github.com/apache/gravitino/pull/9623#discussion_r2697867074
########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,245 @@ +--- +title: "Lance REST Integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide provides comprehensive instructions for integrating the Apache Gravitino Lance REST service with data processing engines that support the Lance format, including Apache Spark via the [Lance Spark connector](https://lance.org/integrations/spark/) and Ray via the [Lance Ray connector](https://lance.org/integrations/ray/). + +This documentation assumes familiarity with the Lance REST service setup as described in the [Lance REST Service](./lance-rest-service) documentation. + +## Compatibility Matrix + +The following table outlines the tested compatibility between Gravitino versions and Lance connector versions: + +| Gravitino Version (Lance REST) | Supported lance-spark Versions | Supported lance-ray Versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +:::note +- These version ranges represent combinations expected to be compatible based on API stability and feature sets. +- While broad compatibility is anticipated within these ranges, only select versions have been explicitly tested. +- We strongly recommend validating specific connector versions in your development environment before production deployment. +- As the Lance ecosystem evolves rapidly, API changes may introduce breaking changes between versions. +::: + +### Why Maintain a Compatibility Matrix? + +The Lance ecosystem is under active development, with frequent updates to APIs and features. Gravitino's Lance REST service depends on specific connector behaviors to ensure reliable operation. Using incompatible versions may result in: + +- Runtime errors or exceptions +- Data corruption or loss +- Unexpected behavior in query execution +- Performance degradation + +## Prerequisites + +Before proceeding, ensure the following requirements are met: + +1. **Gravitino Server**: A running Gravitino server instance with the Lance REST service enabled + - Default endpoint: `http://localhost:9101/lance` + +2. **Lance Catalog**: A Lance catalog created in Gravitino using either: + - Lance REST namespace API (`CreateNamespace` operation - see [Lance REST Service documentation](./lance-rest-service.md) + - Gravitino REST API, for more, please refer to [lakehouse-generic-catalog](./lakehouse-generic-catalog.md) + - Example catalog name: `lance_catalog` + +3. **Lance Spark Bundle** (for Spark integration): + - Downloaded `lance-spark` bundle JAR matching your Apache Spark version + - Note the absolute file path for configuration + +4. **Python Dependencies**: + - For Spark integration: `pyspark` + - For Ray integration: `ray`, `lance-namespace`, `lance-ray` + +## Spark Integration + +### Configuration + +The following example demonstrates how to configure a PySpark session to interact with Lance REST and perform table operations using Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging + +# Configure logging for debugging +logging. basicConfig(level=logging.INFO) + +# Configure Spark to use the lance-spark bundle +# Replace /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar with your actual JAR path +os.environ["PYSPARK_SUBMIT_ARGS"] = ( + "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar " + "--conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" " + "--conf \"spark.executor.extraJavaOptions=--add-opens=java. base/sun.nio.ch=ALL-UNNAMED\" " + "--master local[1] pyspark-shell" +) + +# Initialize Spark session with Lance REST catalog configuration +# Note: The catalog "lance_catalog" must exist in Gravitino before running this code, you can create +# it via Lance REST API `CreateNameSpace` or Gravitino REST API `CreateCatalog`. +spark = SparkSession.builder \ + .appName("lance_rest_integration") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance. spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark. sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance. parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +# Enable debug logging for troubleshooting +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema (database) +spark.sql("CREATE DATABASE IF NOT EXISTS sales") + +# Create Lance table with explicit location +spark.sql(""" + CREATE TABLE sales.orders ( + id INT, + score FLOAT + ) + USING lance + LOCATION '/tmp/sales/orders.lance/' + TBLPROPERTIES ('format' = 'lance') +""") + +# Insert sample data +spark.sql("INSERT INTO sales.orders VALUES (1, 1.1)") + +# Query data +spark.sql("SELECT * FROM sales.orders").show() +``` + +### Storage Location Configuration + +#### Local Storage + +The `LOCATION` clause in the `CREATE TABLE` statement is optional. When omitted, lance-spark automatically determines an appropriate storage location based on catalog properties. +For detailed information on location resolution logic, refer to the [Lakehouse Generic Catalog documentation](./lakehouse-generic-catalog.md#key-property-location). Review Comment: Why does this chapter belong to local storage? Also, as I remembered, external table must specify the location, am I right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
