keen85 opened a new issue, #54418:
URL: https://github.com/apache/spark/issues/54418

   ### Description
   
   Currently, the 
[`pyspark.sql.DataFrame.mergeInto`](https://spark.apache.org/docs/4.0.1/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mergeInto.html)
 API (introduced in Spark 4.0) only accepts a table name that must be 
registered in the Spark Catalog.
   
   While Delta Lake's standalone Python API allows `DeltaTable.forPath()`, the 
native PySpark `mergeInto` method lacks a direct way to target a Delta table 
(or any supported provider) via a URI/path without first registering it as a 
table in the catalog.
   
   ### Motivation
   
   In many modern Data Lake architectures, data engineers often interact with 
tables directly via their storage paths to:
   
   1. Avoid catalog overhead for transient or landing data.
   2. Operate in environments where a shared Hive Metastore or Unity Catalog 
might not be the primary source of truth for every directory.
   3. Simplify CI/CD pipelines where physical paths are parameterized.
   
   Adding support for paths would bring `mergeInto` in line with other PySpark 
APIs like `spark.read.load(path)` or `df.write.save(path)`.
   
   ### Proposed Change
   
   Modify `pyspark.sql.DataFrame.mergeInto(tableName)` to either:
   
   1. **Automatically detect paths:** If the string starts with a protocol 
(e.g., `abfss://`, `s3://`) or a forward slash, treat it as a path.
   2. **Add an optional parameter:** Add a boolean flag or a specific method to 
distinguish between a catalog table and a path.
   
   **Proposed Syntax Example:**
   
   ```python
   # Current limitation: requires catalog registration
   # df.mergeInto("prod.db.target_table").whenMatchedUpdateAll().execute()
   
   # Proposed: Reference via path directly
   path = "abfss://[email protected]/layer/table_name"
   
   # Option A: String detection (similar to SQL: MERGE INTO delta.`path`)
   df.mergeInto(table=f"delta.`{path}`").whenMatchedUpdateAll().execute()
   
   # Option B: Explicit parameter
   df.mergeInto(path=path, format="delta").whenMatchedUpdateAll().execute()
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to