Herman van Hovell commented on SPARK-23945:

[~nchammas] we didn't add explicit dataset support because no-one asked for it, 
until now :)....

What do you want to support here? {{(NOT) IN}} and {{EXISTS}}? Or do you also 
want to add support for scalar subqueries, and subqueries in filters?

> Column.isin() should accept a single-column DataFrame as input
> --------------------------------------------------------------
>                 Key: SPARK-23945
>                 URL: https://issues.apache.org/jira/browse/SPARK-23945
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> FROM table1
> WHERE name NOT IN (
>     SELECT name
>     FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
>     .where(
>         ~col('name').isin(
>             table2.select('name')
>         )
>     )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to