[ 
https://issues.apache.org/jira/browse/SPARK-41666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-41666:
-----------------------------
    Description: 
Enhance the PySpark SQL API with support for parameterized SQL statements to 
improve security and reusability. Application developers will be able to write 
SQL with parameter markers whose values will be passed separately from the SQL 
code and interpreted as literals. This will help prevent SQL injection attacks 
for applications that generate SQL based on a user’s selections, which is often 
done via a user interface.

PySpark has already supported formatting of sqlText using the syntax {...}. 
Need to leave the API the same:

{code:python}
def sql(self, sqlQuery: str, **kwargs: Any) -> DataFrame:
{code}

and support new parameters by the same API.

PySpark *sql()* should passes unused parameters to the JVM side where the Java 
sql() method handles them. For example:

{code:python}
>>> mydf = spark.range(10)
>>> spark.sql("SELECT id FROM {mydf} WHERE id % @param1 = 0", mydf=mydf, 
>>> param1='3').show()
+---+
| id|
+---+
|  0|
|  3|
|  6|
|  9|
+---+
{code}


  was:Enhance the Spark SQL API with support for parameterized SQL statements 
to improve security and reusability. Application developers will be able to 
write SQL with parameter markers whose values will be passed separately from 
the SQL code and interpreted as literals. This will help prevent SQL injection 
attacks for applications that generate SQL based on a user’s selections, which 
is often done via a user interface.


> Support parameterized SQL in PySpark
> ------------------------------------
>
>                 Key: SPARK-41666
>                 URL: https://issues.apache.org/jira/browse/SPARK-41666
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>             Fix For: 3.4.0
>
>
> Enhance the PySpark SQL API with support for parameterized SQL statements to 
> improve security and reusability. Application developers will be able to 
> write SQL with parameter markers whose values will be passed separately from 
> the SQL code and interpreted as literals. This will help prevent SQL 
> injection attacks for applications that generate SQL based on a user’s 
> selections, which is often done via a user interface.
> PySpark has already supported formatting of sqlText using the syntax {...}. 
> Need to leave the API the same:
> {code:python}
> def sql(self, sqlQuery: str, **kwargs: Any) -> DataFrame:
> {code}
> and support new parameters by the same API.
> PySpark *sql()* should passes unused parameters to the JVM side where the 
> Java sql() method handles them. For example:
> {code:python}
> >>> mydf = spark.range(10)
> >>> spark.sql("SELECT id FROM {mydf} WHERE id % @param1 = 0", mydf=mydf, 
> >>> param1='3').show()
> +---+
> | id|
> +---+
> |  0|
> |  3|
> |  6|
> |  9|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to