[jira] [Commented] (SPARK-35803) Spark SQL does not support creating views using DataSource v2 based data sources

David Rabinowitz (Jira) Tue, 29 Jun 2021 10:59:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371548#comment-17371548
 ]


David Rabinowitz commented on SPARK-35803:
------------------------------------------

Using regular spark-shell:
{code:java}
spark-shell --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
scala> val df1 = 
spark.read.format("bigquery").load("bigquery-public-data.samples.shakespeare")
df1: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df1.count
res0: Long = 164656                                                             

scala> val df2 = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("bigquery-public-data.samples.shakespeare")
df2: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df2.count
res1: Long = 164656
{code}
Using spark-sql:


{code:java}
spark-sql --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s1 USING bigquery options 
(table 'bigquery-public-data.samples.shakespeare');
Time taken: 2.143 seconds
spark-sql> select count(*) from global_temp.s1;
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Querying table 
bigquery-public-data.samples.shakespeare, param
eters sent from Spark: requiredColumns=[], filters=[]
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Going to read 
from bigquery-public-data.samples.shakespeare co
lumns=[], filter=''
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Used optimized 
BQ count(*) path. Count: 164656
164656
Time taken: 3.767 seconds, Fetched 1 row(s)
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s2 USING 
com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 options (table 
'bigquery-public-
data.samples.shakespeare');
Error in query: com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 is not 
a valid Spark SQL Data Source.;
{code}

Both runs used Spark 2.4.8 with Scala 2.12 (Dataproc image 1.5). The same code 
path exists in Spark 3 as well. The code for the connector is at 
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

 

> Spark SQL does not support creating views using DataSource v2 based data 
> sources
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-35803
>                 URL: https://issues.apache.org/jira/browse/SPARK-35803
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: David Rabinowitz
>            Priority: Major
>
> When a temporary view is created in Spark SQL using an external data source, 
> Spark then tries to create the relevant relation using 
> DataSource.resolveRelation() method. Unlike DataFrameReader.load(), 
> resolveRelation() does not check if the provided DataSource implements the 
> DataSourceV2 interface and instead tries to use the RelationProvider trait in 
> order to generate the Relation.
> Furthermore, DataSourceV2Relation is not a subclass of BaseRelation, so it 
> cannot be used in resolveRelation().
> Last, I tried to implement the RelationProvider trait in my Java 
> implementation of DataSourceV2, but the match inside resolveRelation() did 
> not detect it as RelationProvider.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35803) Spark SQL does not support creating views using DataSource v2 based data sources

Reply via email to