Well, Spark to BigQuery API is very efficient in doing what it needs to do.
Personally I have never found a JDBC connection to BigQuery that works
under all circumstances

.
In a typical environment you need to set-up your connection variable to
BigQuery from Spark.

These are my recommended ones

def setSparkConfBQ(spark):
    try:
        spark.conf.set("GcpJsonKeyFile",
config['GCPVariables']['jsonKeyFile'])
        spark.conf.set("BigQueryProjectId",
config['GCPVariables']['projectId'])
        spark.conf.set("BigQueryDatasetLocation",
config['GCPVariables']['datasetLocation'])
        spark.conf.set("google.cloud.auth.service.account.enable", "true")
        spark.conf.set("fs.gs.project.id",
config['GCPVariables']['projectId'])
        spark.conf.set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
        spark.conf.set("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
        spark.conf.set("temporaryGcsBucket",
config['GCPVariables']['tmp_bucket'])
        spark.conf.set("spark.sql.streaming.checkpointLocation",
config['GCPVariables']['tmp_bucket'])
        return spark
    except Exception as e:
        print(f"""{e}, quitting""")
        sys.exit(1)

Note the setting for GCP temporary bucket for staging Spark writes before
pushing data into bigQuery table.

The connection from Spark to BigQuery itself is pretty simplified. for
example to reads from BQ table you can do

def loadTableFromBQ(spark,dataset,tableName):
    try:
        read_df = spark.read. \
            format("bigquery"). \
            option("credentialsFile",
config['GCPVariables']['jsonKeyFile']). \
            option("dataset", dataset). \
            option("table", tableName). \
            load()
        return read_df
    except Exception as e:
        print(f"""{e}, quitting""")
        sys.exit(1)

and how to read it

      read_df = s.loadTableFromBQ(self.spark,
config['GCPVariables']['targetDataset'], config['GCPVariables']['ATable'])

OK each connection will be lazily evaluated bar checking that the
underlying table exists

The next stage is to create a read_df Data Frame for each table and you do
joins join etc in Spark itself.

At times it is more efficient for BigQuery to do the join itself and create
a result set table in BigQuery dataset that you can import into Spark.

Whatever approach there is a solution and as usual your mileage varies so
to speak.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 14 May 2021 at 01:50, bo zhao <zhaobo20082...@gmail.com> wrote:

> Hi Team,
>
> I've followed Spark community for several years. This is my first time for
> asking help. I hope you guys can give some experience.
>
> I want to develop a spark application with processing a sqlscript file.
> The data is on BigQuery.
> For example, the sqlscript is:
>
> delete from tableA;
> insert into tableA select b.columnB1, c.columnC2 from tableB b, tableC c;
>
>
> I can parse this file. In my opinion, After parsing the file, steps should
> follow these below:
>
> #step1: read tableB, tableC into memory(Spark)
> #step2. register views for tableB's dataframe and tableC's dataframe
> #step3. use spark.sql("select b.columnB1, c.columnC2 from tableB b, tableC
> c") to get a new dataframe
> #step4. new dataframe.write().() to tableA using mode of "OVERWRITE"
>
> My question:
> #1 If there are 10 tables or more tables, do I need to read each table
> into memory though Spark bases on memory compution?
> #2 Is there a much easier way to deal with my scenarios, for example, I
> just define the datasource(BigQuery) and just parse sqlscript file, others
> are run by Spark.
>
> Please share your experience or idea.
>

Reply via email to