[
https://issues.apache.org/jira/browse/SPARK-35803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371548#comment-17371548
]
David Rabinowitz commented on SPARK-35803:
------------------------------------------
Using regular spark-shell:
{code:java}
spark-shell --packages
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
scala> val df1 =
spark.read.format("bigquery").load("bigquery-public-data.samples.shakespeare")
df1: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2
more fields]
scala> df1.count
res0: Long = 164656
scala> val df2 =
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("bigquery-public-data.samples.shakespeare")
df2: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2
more fields]
scala> df2.count
res1: Long = 164656
{code}
Using spark-sql:
{code:java}
spark-sql --packages
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s1 USING bigquery options
(table 'bigquery-public-data.samples.shakespeare');
Time taken: 2.143 seconds
spark-sql> select count(*) from global_temp.s1;
21/06/29 17:51:34 INFO
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Querying table
bigquery-public-data.samples.shakespeare, param
eters sent from Spark: requiredColumns=[], filters=[]
21/06/29 17:51:34 INFO
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Going to read
from bigquery-public-data.samples.shakespeare co
lumns=[], filter=''
21/06/29 17:51:34 INFO
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Used optimized
BQ count(*) path. Count: 164656
164656
Time taken: 3.767 seconds, Fetched 1 row(s)
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s2 USING
com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 options (table
'bigquery-public-
data.samples.shakespeare');
Error in query: com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 is not
a valid Spark SQL Data Source.;
{code}
Both runs used Spark 2.4.8 with Scala 2.12 (Dataproc image 1.5). The same code
path exists in Spark 3 as well. The code for the connector is at
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
> Spark SQL does not support creating views using DataSource v2 based data
> sources
> --------------------------------------------------------------------------------
>
> Key: SPARK-35803
> URL: https://issues.apache.org/jira/browse/SPARK-35803
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.4.8, 3.1.2
> Reporter: David Rabinowitz
> Priority: Major
>
> When a temporary view is created in Spark SQL using an external data source,
> Spark then tries to create the relevant relation using
> DataSource.resolveRelation() method. Unlike DataFrameReader.load(),
> resolveRelation() does not check if the provided DataSource implements the
> DataSourceV2 interface and instead tries to use the RelationProvider trait in
> order to generate the Relation.
> Furthermore, DataSourceV2Relation is not a subclass of BaseRelation, so it
> cannot be used in resolveRelation().
> Last, I tried to implement the RelationProvider trait in my Java
> implementation of DataSourceV2, but the match inside resolveRelation() did
> not detect it as RelationProvider.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]