[PR] [SEDONA-2360] Support fetching libpostal model data from HDFS/object store [sedona]

via GitHub Tue, 10 Feb 2026 01:25:34 -0800


jiayuasu opened a new pull request, #2637:
URL: https://github.com/apache/sedona/pull/2637


   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Development Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[SEDONA-XXX] my subject`. Closes 
#2360
   
   ## What changes were proposed in this PR?
   
   This PR enables libpostal model data (`spark.sedona.libpostal.dataDir`) to 
be loaded from remote filesystems such as HDFS, S3 (`s3a://`), GCS (`gs://`), 
and ABFS (`abfs://`), in addition to local paths.
   
   ### Problem
   
   jpostal requires the ~2 GB libpostal model data to reside on the **local** 
filesystem. In cloud deployments, users must manually pre-install the data on 
every executor node, which is operationally cumbersome.
   
   ### Solution
   
   When the configured `dataDir` points to a remote URI, the data is 
automatically downloaded to a per-node local cache directory 
(`java.io.tmpdir/sedona-libpostal-cache/<hash>`) before initializing jpostal. 
Key design decisions:
   
   - **Shared utility**: Introduced `HadoopFileSystemUtils` with 
`isLocalFileSystem`, `copyFileToLocal`, and `copyDirectoryToLocal`. The 
existing GeoPackage `FileSystemUtils` is refactored to delegate to this shared 
utility — zero changes to its 9 call sites.
   - **`LibPostalDataLoader`**: New object that resolves remote paths to cached 
local directories. Uses SHA-256 hash of the remote URI as cache key, and a 
`.sedona_libpostal_complete` marker file to detect partial copies.
   - **Thread safety**: `ConcurrentHashMap`-based per-cache-key locking with 
double-checked locking prevents concurrent downloads of the same data. Lock 
entries are cleaned up after successful copy.
   - **Hadoop credentials**: Uses `SparkHadoopUtil.get.conf` (with fallback to 
`new Configuration()`) to inherit `spark.hadoop.*` config (e.g., S3 
credentials) from the Spark session.
   - **Auto-download disabled for remote paths**: When data is fetched 
remotely, jpostal's own download mechanism is disabled to avoid redundant 
network calls.
   
   ### Files changed
   
   | File | Change |
   |------|--------|
   | `spark/common/.../utils/HadoopFileSystemUtils.scala` | **New** — shared 
Hadoop FS → local copy utility |
   | `spark/common/.../expressions/LibPostalDataLoader.scala` | **New** — 
remote path resolution with local caching |
   | `spark/common/.../expressions/LibPostalUtils.scala` | **Modified** — call 
`LibPostalDataLoader.resolveDataDir()` before jpostal init |
   | `spark/common/.../geopackage/connection/FileSystemUtils.scala` | 
**Modified** — delegate to shared `HadoopFileSystemUtils` |
   | `docs/api/sql/Function.md` | **Modified** — document remote URI support 
for `ExpandAddress` and `ParseAddress` |
   | `spark/common/.../sql/HadoopFileSystemUtilsTest.scala` | **New** — tests 
for shared utility (local, HDFS via MiniDFSCluster) |
   | `spark/common/.../sql/LibPostalDataLoaderTest.scala` | **New** — tests for 
path detection, caching, and 8-thread concurrent access |
   
   ## How was this patch tested?
   
   - 22 new unit tests covering:
     - `isLocalFileSystem` — local paths, `file://` URIs
     - `copyFileToLocal` — local passthrough, HDFS round-trip
     - `copyDirectoryToLocal` — recursive directory tree, error handling
     - `isRemotePath` — 12 URI schemes (local, relative, `file://`, `hdfs://`, 
`s3a://`, `s3://`, `gs://`, `abfs://`, `wasb://`, empty, Windows)
     - `resolveDataDir` — local unchanged, HDFS round-trip with subdirectories
     - Concurrent access — 8 threads with `CyclicBarrier` stress test
   - All tests use `MiniDFSCluster` for real HDFS operations
   - All 22 tests pass: `mvn test -pl spark/common -Dspark=3.5 -Dscala=2.12 
-Dtest=none -DfailIfNoTests=false 
-DwildcardSuites="org.apache.sedona.sql.HadoopFileSystemUtilsTest,org.apache.sedona.sql.LibPostalDataLoaderTest"`
   
   ## Did this PR include necessary documentation updates?
   
   - Yes, I have updated the documentation. Added `!!!Note` blocks to both 
`ST_ExpandAddress` and `ST_ParseAddress` in `docs/api/sql/Function.md` 
documenting `spark.sedona.libpostal.dataDir` support for remote URIs 
(`hdfs://`, `s3a://`, `gs://`, `abfs://`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [SEDONA-2360] Support fetching libpostal model data from HDFS/object store [sedona]

Reply via email to