Copilot commented on code in PR #9623: URL: https://github.com/apache/gravitino/pull/9623#discussion_r2696757612
########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. Review Comment: There's a grammatical error in this sentence. "Why does we need" should be "Why do we need". The subject "we" requires the plural form of the verb. ```suggestion Why do we need to maintain a compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. Review Comment: There's a spacing issue. There should be a space after "API" and before the parenthesis. It should read "Lance REST namespace API (see" instead of "Lance REST namespace API(see". ```suggestion - A Lance catalog created in Gravitino via Lance REST namespace API (see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) +::: + +The storage location in the example above is local path, if you want to use cloud storage, please refer to the following MinIO example: Review Comment: There's a grammatical issue in this sentence. "The storage location in the example above is local path" should be "The storage location in the example above is a local path". The indefinite article "a" is missing before "local path". ```suggestion The storage location in the example above is a local path, if you want to use cloud storage, please refer to the following MinIO example: ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) Review Comment: There's a typo in the note text. "will use try to calculate" should be "will try to calculate". The word "use" should be removed. ```suggestion The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) Review Comment: The documentation link reference anchor might be incorrect. The text references "location resolution logic" and links to `./lakehouse-generic-catalog.md#catalog-properties`, but based on the target file structure, it should probably link to `./lakehouse-generic-catalog.md#key-property-location` to point directly to the section explaining the location resolution hierarchy. ```suggestion The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#key-property-location) ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) +::: + +The storage location in the example above is local path, if you want to use cloud storage, please refer to the following MinIO example: + +```python +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION 's3://bucket/tmp/schema/sample.lance/' +TBLPROPERTIES ( + 'format' = 'lance', + 'lance.storage.access_key_id' = 'ak', + 'lance.storage.endpoint' = 'http://minio:9000', + 'lance.storage.secret_access_key' = 'sk', + 'lance.storage.allow_http' = 'true' + )""") +``` + +## Using Lance REST with Ray + +The snippet below writes and reads a Lance dataset through the Lance REST namespace. + +```shell +pip install lance-ray +``` +Please note that Ray will also be installed if not already present. Currently lance-ray is only tested with Ray version 2.41.0 to 2.50.0, please ensure Ray version compatibility in your environment. + Review Comment: The compatibility information for Ray versions is stated in prose but not in a machine-readable format that matches the compatibility matrix table. Consider adding Ray versions to the compatibility matrix table (similar to how lance-spark and lance-ray connector versions are listed) to provide a complete picture of tested version combinations. This would help users quickly identify compatible versions for their environment. ```suggestion Please note that Ray will also be installed if not already present. ### Ray compatibility | Component | Tested Ray versions | |------------|---------------------| | `lance-ray` | 2.41.0 – 2.50.0 | Ensure that the Ray version in your environment falls within the tested range shown above. ``` ########## docs/lance-rest-service.md: ########## @@ -399,3 +398,7 @@ ns.create_table(create_table_request, body) </TabItem> </Tabs> + +## Integration with Lance REST + +About using Lance REST service with Apache Spark, Ray and other engine, please refer to [lance-rest-integration](./lance-rest-integration.md) for more details. Review Comment: There's a grammatical issue in this sentence. "About using Lance REST service" should be "About using the Lance REST service" or more naturally "For using the Lance REST service" or "To use the Lance REST service". Also, "other engine" should be "other engines" (plural). ```suggestion To use the Lance REST service with Apache Spark, Ray, and other engines, please refer to [lance-rest-integration](./lance-rest-integration.md) for more details. ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). Review Comment: The link reference to the Lance REST service documentation is incomplete. It should include the `.md` extension for consistency with markdown linking practices, or use just `./lance-rest-service` without the extension if following a different convention. Check line 15 where it says `[Lance REST service](./lance-rest-service)` - this should be `[Lance REST service](./lance-rest-service.md)` to match the actual file name. ```suggestion This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service.md). ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). Review Comment: There's a spacing issue in this sentence. There should be no space before the period. It should be "support Lance format." instead of "support Lance format ." ```suggestion This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format. It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). ``` ########## docs/lance-rest-integration.md: ########## @@ -0,0 +1,134 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`), the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`) and other data processing engines that support Lance format . It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. + +Why does we need to maintain compatibility matrix? As Lance and Lance connectors are actively developed, some APIs and features may change over time. Gravitino's Lance REST service relies on specific versions of these connectors to ensure seamless integration and functionality. Using incompatible versions may lead to unexpected behavior or errors. + + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +The line `LOCATION '/tmp/schema/sample.lance/'` is optional, if not specified, lance-spark will use try to calculate the location automatically. About the location resolution logic, please refer to the [documentation](./lakehouse-generic-catalog.md#catalog-properties) +::: + +The storage location in the example above is local path, if you want to use cloud storage, please refer to the following MinIO example: + +```python +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION 's3://bucket/tmp/schema/sample.lance/' +TBLPROPERTIES ( + 'format' = 'lance', + 'lance.storage.access_key_id' = 'ak', + 'lance.storage.endpoint' = 'http://minio:9000', + 'lance.storage.secret_access_key' = 'sk', + 'lance.storage.allow_http' = 'true' + )""") +``` + +## Using Lance REST with Ray + +The snippet below writes and reads a Lance dataset through the Lance REST namespace. + +```shell +pip install lance-ray +``` +Please note that Ray will also be installed if not already present. Currently lance-ray is only tested with Ray version 2.41.0 to 2.50.0, please ensure Ray version compatibility in your environment. + +After installing `lance-ray`, you can run the following Ray script: + +```python +import ray +import lance_namespace as ln +from lance_ray import read_lance, write_lance + +ray.init() + +namespace = ln.connect("rest", {"uri": "http://localhost:9101/lance"}) + +data = ray.data.range(1000).map(lambda row: {"id": row["id"], "value": row["id"] * 2}) + +# Please note that namespace `schema` should also be created via Lance REST API or Gravitino API beforehand. +write_lance(data, namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) +ray_dataset = read_lance(namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) + +result = ray_dataset.filter(lambda row: row["value"] < 100).count() +print(f"Filtered count: {result}") +``` + +:::note +- Ensure the target Lance catalog (`lance_catalog`) and schema (`schema`) already exist in Gravitino. +- The table path is represented as `["catalog", "schema", "table"]` when using Lance Ray helpers. +::: + + +## Other engines + +Lance REST can also be used with other engines that support Lance format, such as DuckDB and Pandas. Please refer to the respective [integration documentation](https://lance.org/integrations/datafusion/) for details on how to connect to Lance REST from those engines. Review Comment: The link to Lance integrations documentation appears to be incorrect or misleading. The text mentions "DuckDB and Pandas" but the link points to "datafusion" integration. This link should either point to a general Lance integrations page or be updated to match the engines mentioned in the text. Consider linking to https://lance.org/integrations/ for the general integrations page. ```suggestion Lance REST can also be used with other engines that support Lance format, such as DuckDB and Pandas. Please refer to the respective [integration documentation](https://lance.org/integrations/) for details on how to connect to Lance REST from those engines. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
