jiayuasu opened a new pull request, #2655:
URL: https://github.com/apache/sedona/pull/2655
## Problem
When reading shapefiles from S3 using Spark DataSource V2, users see
spurious `FileNotFoundException` warnings like:
```
WARN FileStreamSink: Assume no metadata directory. Error while looking for
metadata directory in the path: s3a://bucket/path/file.???
```
## Root Cause
Spark's `FileTable.fileIndex` lazy val calls `FileStreamSink.hasMetadata()`
which tries to stat the input paths as directories. For shapefiles,
`ShapefileDataSource.transformPaths()` converts `.shp` paths to glob patterns
(e.g., `file.???`). When `hasMetadata` tries `fs.getFileStatus(new
Path("file.???"))` on S3, it throws `FileNotFoundException` which is caught and
logged as a WARN. This check is only relevant for streaming sinks, not batch
read-only sources.
## Fix
Override `fileIndex` in `ShapefileTable`, `GeoPackageTable`, and
`GeoParquetMetadataTable` to construct the `InMemoryFileIndex` directly,
skipOverride `fileIndex` in `ShapefileTable`, `GeoPackageTable`, and
`GeoParquetMetadataTahIfOverride `fileIndex` in `ShapefileTable`,
`GeoPackageTable`, and `GeoParquetMetadataTable` to construct the \e
`org.apOverride `fileIndex` in `ShapefileTable`, `GeoPackageTable`, and
`GeoParquetMetadataTable` to construct the `InMemoryFileIncala` in Override
`fon` — bridge to access package-private
`DataSource.checkAndGlobPathIfNecessary`
- **Modified**: `ShapefileTable.scala` (4 Spark versions) — override
`fileIndex`
- **Modified**: `GeoPackageTable.scala` (4 Spark versions) — override
`fileIndex`
- **Modified**: `GeoParquetMetadataTable.scala` (4 Spark versions) —
override `fileIndex`
## Testing
All existing tests pass across all 4 Spark versions (3.4, 3.5, 4.0, 4.1):
- 23 Shapefile tests per version
- 10 GeoPackage tests per version
- 7 GeoParquet Metadata tests per version
Verified that the `Assume no metadata directory` warning is no longer
emitted for shapefile/geopackage/geoparquet reads (only Spark's built-in CSV
reader still produces it for its own `*.csv` glob paths).
Fixes #2650
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]