keen85 opened a new issue, #54418: URL: https://github.com/apache/spark/issues/54418
### Description Currently, the [`pyspark.sql.DataFrame.mergeInto`](https://spark.apache.org/docs/4.0.1/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mergeInto.html) API (introduced in Spark 4.0) only accepts a table name that must be registered in the Spark Catalog. While Delta Lake's standalone Python API allows `DeltaTable.forPath()`, the native PySpark `mergeInto` method lacks a direct way to target a Delta table (or any supported provider) via a URI/path without first registering it as a table in the catalog. ### Motivation In many modern Data Lake architectures, data engineers often interact with tables directly via their storage paths to: 1. Avoid catalog overhead for transient or landing data. 2. Operate in environments where a shared Hive Metastore or Unity Catalog might not be the primary source of truth for every directory. 3. Simplify CI/CD pipelines where physical paths are parameterized. Adding support for paths would bring `mergeInto` in line with other PySpark APIs like `spark.read.load(path)` or `df.write.save(path)`. ### Proposed Change Modify `pyspark.sql.DataFrame.mergeInto(tableName)` to either: 1. **Automatically detect paths:** If the string starts with a protocol (e.g., `abfss://`, `s3://`) or a forward slash, treat it as a path. 2. **Add an optional parameter:** Add a boolean flag or a specific method to distinguish between a catalog table and a path. **Proposed Syntax Example:** ```python # Current limitation: requires catalog registration # df.mergeInto("prod.db.target_table").whenMatchedUpdateAll().execute() # Proposed: Reference via path directly path = "abfss://[email protected]/layer/table_name" # Option A: String detection (similar to SQL: MERGE INTO delta.`path`) df.mergeInto(table=f"delta.`{path}`").whenMatchedUpdateAll().execute() # Option B: Explicit parameter df.mergeInto(path=path, format="delta").whenMatchedUpdateAll().execute() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
