jiayuasu opened a new pull request, #2654: URL: https://github.com/apache/sedona/pull/2654
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Developer Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2651 ## What changes were proposed in this PR? When reading GeoPackage files via the DataSource V2 API, the standard `_metadata` hidden column (containing `file_path`, `file_name`, `file_size`, `file_block_start`, `file_block_length`, `file_modification_time`) was missing from the DataFrame. This is because `GeoPackageTable` did not implement Spark's `SupportsMetadataColumns` interface. This PR implements `_metadata` support across all four Spark version modules (3.4, 3.5, 4.0, 4.1) by modifying four source files per module: 1. **GeoPackageTable** — Mixes in `SupportsMetadataColumns` and defines the `_metadata` `MetadataColumn` with the standard six-field struct type. 2. **GeoPackageScanBuilder** — Overrides `pruneColumns()` to capture the pruned metadata schema requested by Spark's column pruning optimizer. 3. **GeoPackageScan** — Accepts the `metadataSchema` parameter, overrides `readSchema()` to append metadata fields to the output schema, and passes the schema to the partition reader factory. 4. **GeoPackagePartitionReaderFactory** — Constructs metadata values (path, name, size, block offset/length, modification time) from the `PartitionedFile`, and wraps the base reader in a `PartitionReaderWithMetadata` that joins data rows with metadata using `JoinedRow` + `GenerateUnsafeProjection`. Correctly handles Spark's struct pruning by building only the requested sub-fields. After this change, users can query `_metadata` on GeoPackage DataFrames just like Parquet/ORC/CSV: ```scala val df = spark.read.format("geopackage").option("tableName", "my_table").load("/path/to/data.gpkg") df.select("geometry", "_metadata.file_name", "_metadata.file_size").show() df.filter($"_metadata.file_name" === "specific.gpkg").show() ``` ## How was this patch tested? 8 new test cases added to `GeoPackageReaderTest` (per Spark version module) covering: - Schema validation: `_metadata` struct contains all 6 expected fields with correct types - Hidden column semantics: `_metadata` does not appear in `select(*)` but can be explicitly selected - Value correctness: `file_path`, `file_name`, `file_size`, `file_block_start`, `file_block_length`, and `file_modification_time` are verified against actual filesystem values using `java.io.File` APIs - Filtering: `_metadata` fields can be used in `WHERE` clauses - Projection: `_metadata` fields can be selected alongside data columns All tests pass on all four Spark versions: - spark-3.4 (Scala 2.12): 18 tests passed - spark-3.5 (Scala 2.12): 18 tests passed - spark-4.0 (Scala 2.13): 18 tests passed - spark-4.1 (Scala 2.13): 18 tests passed ## Did this PR include necessary documentation updates? - No, this PR does not affect any public API so no need to change the documentation. The `_metadata` column is a standard Spark hidden column that is automatically available to users — no Sedona-specific API changes are introduced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
