jiayuasu opened a new pull request, #2654:
URL: https://github.com/apache/sedona/pull/2654

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Developer Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2651
   
   ## What changes were proposed in this PR?
   
   When reading GeoPackage files via the DataSource V2 API, the standard 
`_metadata` hidden column (containing `file_path`, `file_name`, `file_size`, 
`file_block_start`, `file_block_length`, `file_modification_time`) was missing 
from the DataFrame. This is because `GeoPackageTable` did not implement Spark's 
`SupportsMetadataColumns` interface.
   
   This PR implements `_metadata` support across all four Spark version modules 
(3.4, 3.5, 4.0, 4.1) by modifying four source files per module:
   
   1. **GeoPackageTable** — Mixes in `SupportsMetadataColumns` and defines the 
`_metadata` `MetadataColumn` with the standard six-field struct type.
   2. **GeoPackageScanBuilder** — Overrides `pruneColumns()` to capture the 
pruned metadata schema requested by Spark's column pruning optimizer.
   3. **GeoPackageScan** — Accepts the `metadataSchema` parameter, overrides 
`readSchema()` to append metadata fields to the output schema, and passes the 
schema to the partition reader factory.
   4. **GeoPackagePartitionReaderFactory** — Constructs metadata values (path, 
name, size, block offset/length, modification time) from the `PartitionedFile`, 
and wraps the base reader in a `PartitionReaderWithMetadata` that joins data 
rows with metadata using `JoinedRow` + `GenerateUnsafeProjection`. Correctly 
handles Spark's struct pruning by building only the requested sub-fields.
   
   After this change, users can query `_metadata` on GeoPackage DataFrames just 
like Parquet/ORC/CSV:
   
   ```scala
   val df = spark.read.format("geopackage").option("tableName", 
"my_table").load("/path/to/data.gpkg")
   df.select("geometry", "_metadata.file_name", "_metadata.file_size").show()
   df.filter($"_metadata.file_name" === "specific.gpkg").show()
   ```
   
   ## How was this patch tested?
   
   8 new test cases added to `GeoPackageReaderTest` (per Spark version module) 
covering:
   
   - Schema validation: `_metadata` struct contains all 6 expected fields with 
correct types
   - Hidden column semantics: `_metadata` does not appear in `select(*)` but 
can be explicitly selected
   - Value correctness: `file_path`, `file_name`, `file_size`, 
`file_block_start`, `file_block_length`, and `file_modification_time` are 
verified against actual filesystem values using `java.io.File` APIs
   - Filtering: `_metadata` fields can be used in `WHERE` clauses
   - Projection: `_metadata` fields can be selected alongside data columns
   
   All tests pass on all four Spark versions:
   - spark-3.4 (Scala 2.12): 18 tests passed
   - spark-3.5 (Scala 2.12): 18 tests passed
   - spark-4.0 (Scala 2.13): 18 tests passed
   - spark-4.1 (Scala 2.13): 18 tests passed
   
   ## Did this PR include necessary documentation updates?
   
   - No, this PR does not affect any public API so no need to change the 
documentation. The `_metadata` column is a standard Spark hidden column that is 
automatically available to users — no Sedona-specific API changes are 
introduced.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to