Re: [PR] [GH-2522] Implemented STAC authentication for both python and scala APIs [sedona]

via GitHub Tue, 25 Nov 2025 10:44:07 -0800


Copilot commented on code in PR #2523:
URL: https://github.com/apache/sedona/pull/2523#discussion_r2561046246



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/stac/StacUtils.scala:
##########
@@ -36,8 +36,37 @@ object StacUtils {
   // Function to load JSON from URL or service
   def loadStacCollectionToJson(opts: Map[String, String]): String = {
     val urlFull: String = getFullCollectionUrl(opts)
+    val headers: Map[String, String] = parseHeaders(opts)
 
-    loadStacCollectionToJson(urlFull)
+    loadStacCollectionToJson(urlFull, headers)
+  }
+
+  /**
+   * Parse headers from the options map.
+   *
+   * Headers can be provided as a JSON string in the "headers" option.
+   *
+   * @param opts
+   *   The options map that may contain a "headers" key with JSON-encoded 
headers
+   * @return
+   *   Map of header names to values
+   */
+  def parseHeaders(opts: Map[String, String]): Map[String, String] = {
+    opts.get("headers") match {
+      case Some(headersJson) =>
+        try {
+          val mapper = new ObjectMapper()
+          mapper.registerModule(DefaultScalaModule)
+          val headersMap = mapper.readValue(headersJson, classOf[Map[String, 
String]])

Review Comment:
   Consider reusing the ObjectMapper instance instead of creating a new one for 
each parseHeaders call. ObjectMapper creation is relatively expensive and this 
method could be called frequently during batch processing. Store a 
static/singleton instance with the DefaultScalaModule already registered.



##########
docs/tutorial/files/stac-sedona-spark.md:
##########
@@ -269,21 +271,167 @@ client.get_collection("aster-l1t").save_to_geoparquet(
 
 These examples demonstrate how to use the Client class to search for items in 
a STAC collection with various filters and return the results as either an 
iterator of PyStacItem objects or a Spark DataFrame.
 
+### Authentication
+
+Many STAC services require authentication to access their data. The STAC 
client supports multiple authentication methods including HTTP Basic 
Authentication, Bearer Token Authentication, and custom headers.
+
+#### Basic Authentication
+
+Basic authentication is commonly used with API keys or username/password 
combinations. Many services (like Planet Labs) use API keys as the username 
with an empty password.
+
+```python
+from sedona.spark.stac import Client
+
+# Example 1: Using an API key (common pattern)
+client = Client.open("https://api.example.com/stac/v1";)
+client.with_basic_auth("your_api_key_here", "")
+
+# Search for items with authentication
+df = client.search(collection_id="example-collection", max_items=10)
+df.show()
+
+# Example 2: Using username and password
+client = Client.open("https://api.example.com/stac/v1";)
+client.with_basic_auth("username", "password")
+
+df = client.search(collection_id="example-collection", max_items=10)
+df.show()
+
+# Example 3: Method chaining
+df = (
+    Client.open("https://api.example.com/stac/v1";)
+    .with_basic_auth("your_api_key", "")
+    .search(collection_id="example-collection", max_items=10)
+)
+df.show()
+```
+
+#### Bearer Token Authentication
+
+Bearer token authentication is used with OAuth2 tokens and JWT tokens. Note 
that some services may only support specific authentication methods.
+
+```python
+from sedona.spark.stac import Client
+
+# Using a bearer token
+client = Client.open("https://api.example.com/stac/v1";)
+client.with_bearer_token("your_access_token_here")
+
+df = client.search(collection_id="example-collection", max_items=10)
+df.show()
+
+# Method chaining
+df = (
+    Client.open("https://api.example.com/stac/v1";)
+    .with_bearer_token("your_token")
+    .search(collection_id="example-collection", max_items=10)
+)
+df.show()
+```
+
+#### Custom Headers
+
+You can also pass custom headers directly when creating the client, which is 
useful for services with non-standard authentication requirements.
+
+```python
+from sedona.spark.stac import Client
+
+# Using custom headers
+headers = {"Authorization": "Bearer your_token_here", "X-Custom-Header": 
"custom_value"}
+client = Client.open("https://api.example.com/stac/v1";, headers=headers)
+
+df = client.search(collection_id="example-collection", max_items=10)
+df.show()
+```
+
+#### Authentication with Scala DataSource
+
+When using the STAC data source directly in Scala or through Spark SQL, you 
can pass authentication headers as a JSON-encoded option:
+
+```python
+import json
+from pyspark.sql import SparkSession
+
+# Prepare authentication headers
+headers = {"Authorization": "Basic <base64_encoded_credentials>"}
+headers_json = json.dumps(headers)
+
+# Load STAC data with authentication
+df = (
+    spark.read.format("stac")
+    .option("headers", headers_json)
+    .load("https://api.example.com/stac/v1/collections/example-collection";)
+)
+
+df.show()
+```
+
+```scala
+// Scala example
+val headersJson = """{"Authorization":"Basic <base64_encoded_credentials>"}"""
+
+val df = sparkSession.read
+  .format("stac")
+  .option("headers", headersJson)
+  .load("https://api.example.com/stac/v1/collections/example-collection";)
+
+df.show()
+```
+
+#### Important Notes
+
+- **Authentication methods are mutually exclusive**: Setting a new 
authentication method will overwrite any previously set Authorization header.

Review Comment:
   This statement could be misleading. While authentication methods do 
overwrite the Authorization header, users can still set other custom headers 
alongside authentication. Consider rephrasing to: 'Setting a new authentication 
method will overwrite any previously set Authorization header, but other custom 
headers remain unchanged.'
   ```suggestion
   - **Authentication methods are mutually exclusive**: Setting a new 
authentication method will overwrite any previously set Authorization header, 
but other custom headers remain unchanged.
   ```



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/stac/StacPartitionReader.scala:
##########
@@ -159,7 +160,20 @@ class StacPartitionReader(
 
     while (attempt < maxRetries && !success) {
       try {
-        fileContent = Source.fromURL(url).mkString
+        if (headers.isEmpty) {
+          fileContent = Source.fromURL(url).mkString
+        } else {
+          val connection = url.openConnection()
+          headers.foreach { case (key, value) =>
+            connection.setRequestProperty(key, value)
+          }
+          val source = Source.fromInputStream(connection.getInputStream)
+          try {
+            fileContent = source.mkString
+          } finally {
+            source.close()
+          }

Review Comment:
   Same connection resource leak issue as in StacUtils. The URLConnection is 
not properly closed. If an exception occurs after opening the connection but 
before or during the source reading, the connection remains open. Consider 
using a nested try-finally or resource management pattern.



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/stac/StacUtils.scala:
##########
@@ -50,18 +79,39 @@ object StacUtils {
     urlFinal
   }
 
-  // Function to load JSON from URL or service
-  def loadStacCollectionToJson(url: String, maxRetries: Int = 3): String = {
+  // Function to load JSON from URL or service with optional headers
+  def loadStacCollectionToJson(
+      url: String,
+      headers: Map[String, String] = Map.empty,
+      maxRetries: Int = 3): String = {
     var retries = 0
     var success = false
     var result: String = ""
 
     while (retries < maxRetries && !success) {
       try {
         result = if (url.startsWith("s3://") || url.startsWith("s3a://")) {
+          // S3 URLs are handled by Spark
           SparkSession.active.read.textFile(url).collect().mkString("\n")
-        } else {
+        } else if (headers.isEmpty) {
+          // No headers - use the simple Source.fromURL approach for backward 
compatibility
           Source.fromURL(url).mkString
+        } else {
+          // Headers provided - use URLConnection to set custom headers
+          val connection = new java.net.URL(url).openConnection()
+
+          // Set all custom headers
+          headers.foreach { case (key, value) =>
+            connection.setRequestProperty(key, value)
+          }
+
+          // Read the response
+          val source = Source.fromInputStream(connection.getInputStream)
+          try {
+            source.mkString
+          } finally {
+            source.close()

Review Comment:
   The connection resource is not properly closed. If an exception occurs in 
`connection.getInputStream` or `source.mkString`, the connection will remain 
open. Wrap the connection setup and usage in a try-finally block or use a 
resource management pattern to ensure the connection is closed.
   ```suggestion
             var inputStream: java.io.InputStream = null
             var source: Source = null
             try {
               inputStream = connection.getInputStream
               source = Source.fromInputStream(inputStream)
               source.mkString
             } finally {
               if (source != null) {
                 try source.close() catch { case _: Throwable => }
               }
               if (inputStream != null) {
                 try inputStream.close() catch { case _: Throwable => }
               }
               connection match {
                 case httpConn: java.net.HttpURLConnection =>
                   try httpConn.disconnect() catch { case _: Throwable => }
                 case _ =>
               }
   ```



##########
spark/common/src/test/scala/org/apache/spark/sql/sedona_sql/io/stac/StacDataSourceTest.scala:
##########
@@ -163,6 +163,139 @@ class StacDataSourceTest extends TestBaseScala {
     assert(rowCount == 0)
   }
 
+  it("should load STAC data with public service without authentication") {
+    // This test verifies backward compatibility with public STAC services
+    // Set STAC_PUBLIC_URL environment variable to test (e.g., Microsoft 
Planetary Computer)
+    // Example: 
https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip
+    val publicUrl = sys.env.get("STAC_PUBLIC_URL")
+
+    if (publicUrl.isDefined) {
+      val dfStac = sparkSession.read
+        .format("stac")
+        .load(publicUrl.get)
+        .limit(10)
+
+      // Verify we can load data
+      assert(dfStac.count() >= 0, "Failed to load data from public STAC 
service")
+    } else {
+      // Skip test if environment variable is not set
+      cancel(
+        "Skipping public STAC service test - set STAC_PUBLIC_URL to run (e.g., 
https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip)")
+    }
+  }
+
+  // Authentication tests for remote services
+  it("should load STAC data with bearer token authentication") {
+    // NOTE: Planet Labs API requires Basic Auth (not Bearer token) for 
collections
+    // Bearer token authentication works with other STAC services
+    // Set STAC_AUTH_URL and STAC_BEARER_TOKEN environment variables to test 
with a service that supports Bearer tokens

Review Comment:
   [nitpick] This comment is helpful but appears in a test for bearer token 
authentication. Consider moving the Planet Labs specific note to the basic 
authentication test where it's more relevant, or rephrasing to focus on the 
general bearer token pattern being tested here.
   ```suggestion
       // This test verifies loading STAC data from services that support 
Bearer token authentication.
       // Set STAC_AUTH_URL and STAC_BEARER_TOKEN environment variables to test 
with a compatible service.
       // Note: Not all STAC services support Bearer tokens; some may require 
other authentication methods.
   ```



##########
python/tests/stac/test_auth_integration.py:
##########
@@ -0,0 +1,193 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Integration tests for STAC authentication with real services.
+
+These tests use environment variables to configure authenticated services.
+Tests are skipped when environment variables are not set.
+
+Environment variables:
+    STAC_PUBLIC_URL - Public STAC service URL (e.g., Microsoft Planetary 
Computer)
+    STAC_AUTH_URL - Authenticated STAC service URL
+    STAC_USERNAME - Username or API key for basic authentication
+    STAC_PASSWORD - Password (can be empty for API keys)
+    STAC_BEARER_TOKEN - Bearer token for token-based authentication
+    STAC_AUTH_URL_REQUIRE_AUTH - URL requiring authentication (for failure 
testing)
+"""
+
+import os
+import pytest
+from sedona.spark.stac.client import Client
+
+from tests.test_base import TestBase
+
+
+class TestStacAuthIntegration(TestBase):
+    """
+    Integration tests for authenticated STAC services.
+
+    Tests are skipped when required environment variables are not set.
+    """
+
+    @pytest.mark.skipif(
+        not os.getenv("STAC_PUBLIC_URL"),
+        reason="STAC_PUBLIC_URL not set - skip public service test",
+    )
+    def test_public_service_without_authentication(self):
+        """
+        Test that public STAC services work without authentication.
+
+        This verifies backward compatibility - existing code should work 
unchanged.
+        Set STAC_PUBLIC_URL environment variable to test.
+        Example: 
https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip
+        """
+        public_url = os.getenv("STAC_PUBLIC_URL")
+
+        client = Client.open(public_url)
+
+        # Try to load data without authentication
+        df = client.search(max_items=10)

Review Comment:
   [nitpick] The docstring references `Client.search()` but the actual call is 
`client.search()`. While functionally equivalent in Python, for consistency 
with the rest of the docstrings in this file and better clarity, update the 
docstring to use lowercase 'client' when referring to an instance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2522] Implemented STAC authentication for both python and scala APIs [sedona]

Reply via email to