Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Saurabh Gulati Tue, 22 Feb 2022 07:58:15 -0800

Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:


  *   Its not reflective of our current lake structure with all deltas/history 
tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be 
a copy of all data in gcs. External tables in BQ didnt work for us. Currently 
we store only latest snapshots in BQ. This breaks idempotency of models which 
need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be 
cloud agnostic.

So we need to have an execution engine which has the same overview of data as 
we have in gcs.
We tried presto but performance was similar and presto didn't support auto 
scaling.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: 22 February 2022 16:49
To: Kidong Lee <mykid...@gmail.com>; Saurabh Gulati <saurabh.gul...@fedex.com>
Cc: user@spark.apache.org <user@spark.apache.org>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is 
that your Spark is version 3.1.1 with standard GKE on auto-scaler. What 
benefits are you getting from Using Hive here? As you have your hive tables on 
gs buckets, you can easily download your hive tables into BigQuery and run 
spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati 
<saurabh.gul...@fedex.com<mailto:saurabh.gul...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh
________________________________
From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <saurabh.gul...@fedex.com.invalid>
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <saurabh.gul...@fedex.com.invalid> 
wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict in spark, so that user gets an exception if they don't 
specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf 
in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore 
container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



--



 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Reply via email to