Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Takeshi Yamamuro
šŸ˜Š

On Tue, May 25, 2021 at 11:00 AM Hyukjin Kwon  wrote:

> Awesome, thanks Takeshi!
>
> 2021ė…„ 5ģ›” 25ģ¼ (ķ™”) ģ˜¤ģ „ 10:59, Takeshi Yamamuro ė‹˜ģ“ ģž‘ģ„±:
>
>> FYI:
>>
>> Thank you for all the comments.
>> I closed 754 tickets in bulk a few minutes ago.
>> Please let me know if there is any problem.
>>
>> Bests,
>> Takeshi
>>
>> On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:
>>
>>> +1ļ¼Œthanks Takeshi
>>>
>>> *Kent Yao *
>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>> *a spark enthusiast*
>>> *kyuubi is a
>>> unified multi-tenant JDBC interface for large-scale data processing and
>>> analytics, built on top of Apache Spark .*
>>> *spark-authorizer A Spark
>>> SQL extension which provides SQL Standard Authorization for **Apache
>>> Spark .*
>>> *spark-postgres  A library
>>> for reading data from and transferring data to Postgres / Greenplum with
>>> Spark SQL and DataFrames, 10~100x faster.*
>>> *itatchi A** library t**hat
>>> brings useful functions from various modern database management systems
>>> toā€‹ **Apache Spark .*
>>>
>>>
>>> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
>>> Thank you, all~
>>>
>>> okay, so I will close them in bulk next week.
>>> If you have more comments, please let me know here.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
>>> wrote:
>>>
 +1, thanks Takeshi !

 Regards,
 Mridul

 On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
 wrote:

> Hi, dev,
>
> As you know, we have too many open JIRAs now:
> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
> Progress", Reopened)'
>
> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
> JIRAs
> for making the JIRAs manageable.
>
> As Hyukjin did the same action two years ago (for details, see:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
> I'm planning to use a similar JQL below to close them:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND
> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
> AND updated <= -52w
>
> The total number of matched JIRAs is 741.
> Or, we might be able to close them more aggressively by removing the
> version condition:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND
> updated <= -52w
>
> The matched number is 1484 (almost half of the current open JIRAs).
>
> If there is no objection, I'd like to do it next week or later.
> Any thoughts?
>
> Bests,
> Takeshi
> --
> ---
> Takeshi Yamamuro
>

>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
---
Takeshi Yamamuro


Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Hyukjin Kwon
Awesome, thanks Takeshi!

2021ė…„ 5ģ›” 25ģ¼ (ķ™”) ģ˜¤ģ „ 10:59, Takeshi Yamamuro ė‹˜ģ“ ģž‘ģ„±:

> FYI:
>
> Thank you for all the comments.
> I closed 754 tickets in bulk a few minutes ago.
> Please let me know if there is any problem.
>
> Bests,
> Takeshi
>
> On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:
>
>> +1ļ¼Œthanks Takeshi
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *itatchi A** library t**hat
>> brings useful functions from various modern database management systems
>> toā€‹ **Apache Spark .*
>>
>>
>> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
>> Thank you, all~
>>
>> okay, so I will close them in bulk next week.
>> If you have more comments, please let me know here.
>>
>> Bests,
>> Takeshi
>>
>> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
>> wrote:
>>
>>> +1, thanks Takeshi !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Hi, dev,

 As you know, we have too many open JIRAs now:
 # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
 Progress", Reopened)'

 We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
 JIRAs
 for making the JIRAs manageable.

 As Hyukjin did the same action two years ago (for details, see:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
 I'm planning to use a similar JQL below to close them:

 project = SPARK AND status in (Open, "In Progress", Reopened) AND
 (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
 AND updated <= -52w

 The total number of matched JIRAs is 741.
 Or, we might be able to close them more aggressively by removing the
 version condition:

 project = SPARK AND status in (Open, "In Progress", Reopened) AND
 updated <= -52w

 The matched number is 1484 (almost half of the current open JIRAs).

 If there is no objection, I'd like to do it next week or later.
 Any thoughts?

 Bests,
 Takeshi
 --
 ---
 Takeshi Yamamuro

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Takeshi Yamamuro
FYI:

Thank you for all the comments.
I closed 754 tickets in bulk a few minutes ago.
Please let me know if there is any problem.

Bests,
Takeshi

On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:

> +1ļ¼Œthanks Takeshi
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi A** library t**hat
> brings useful functions from various modern database management systems
> toā€‹ **Apache Spark .*
>
>
> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
> Thank you, all~
>
> okay, so I will close them in bulk next week.
> If you have more comments, please let me know here.
>
> Bests,
> Takeshi
>
> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
> wrote:
>
>> +1, thanks Takeshi !
>>
>> Regards,
>> Mridul
>>
>> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Hi, dev,
>>>
>>> As you know, we have too many open JIRAs now:
>>> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
>>> Progress", Reopened)'
>>>
>>> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
>>> JIRAs
>>> for making the JIRAs manageable.
>>>
>>> As Hyukjin did the same action two years ago (for details, see:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
>>> I'm planning to use a similar JQL below to close them:
>>>
>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
>>> AND updated <= -52w
>>>
>>> The total number of matched JIRAs is 741.
>>> Or, we might be able to close them more aggressively by removing the
>>> version condition:
>>>
>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>> updated <= -52w
>>>
>>> The matched number is 1484 (almost half of the current open JIRAs).
>>>
>>> If there is no objection, I'd like to do it next week or later.
>>> Any thoughts?
>>>
>>> Bests,
>>> Takeshi
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
>

-- 
---
Takeshi Yamamuro


Re: About Spark executs sqlscript

2021-05-24 Thread Mich Talebzadeh
Apologies I missed your two points

My question:

#1 If there are 10 tables or more tables, do I need to read each table into
memory though Spark bases on memory compution?

Every table will be read as I described above. It is lazy read by Spark.
The computation happens when there is an action on the underlying read.

Regardless of any database (BigQuery, Hive etc), this example below applies:

  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]


rows returns the number of rows. You can see what is happening in Spark UI
under storage tab. If you don't have enough memory, it will be spilled to
disk.

Example below a large in-memory table


[image: image.png]



#2 Is there a much easier way to deal with my scenarios, for example, I
just define the datasource(BigQuery) and just parse sqlscript file, others
are run by Spark.


What do you mean by parse sql script file? Do you mean running a sql file
against spark-sql?


It will not work because you need to use Spark-BigQuery API to access
BigQuery from Spark


example using the following jar file spark-bigquery-latest_2.12.jar



spark-submit --master local[4]
--jars  $HOME/jars/spark-bigquery-latest_2.12.jar


And the way you read each table I described before


   # write to BigQuery table

s.writeTableToBQ(df2,"overwrite",config['GCPVariables']['targetDataset'],config['GCPVariables']['yearlyAveragePricesAllTable'])
print(f"""created
{config['GCPVariables']['yearlyAveragePricesAllTable']}""")
# read data to ensure all loaded OK
read_df = s.loadTableFromBQ(self.spark,
config['GCPVariables']['targetDataset'],
config['GCPVariables']['yearlyAveragePricesAllTable'])
# check that all rows are there
if df2.subtract(read_df).count() == 0:
print("Data has been loaded OK to BQ table")
else:
print("Data could not be loaded to BQ table, quitting")
sys.exit(1)


HTH




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 24 May 2021 at 20:51, Mich Talebzadeh 
wrote:

> Well, Spark to BigQuery API is very efficient in doing what it needs to
> do. Personally I have never found a JDBC connection to BigQuery that works
> under all circumstances
>
> .
> In a typical environment you need to set-up your connection variable to
> BigQuery from Spark.
>
> These are my recommended ones
>
> def setSparkConfBQ(spark):
> try:
> spark.conf.set("GcpJsonKeyFile",
> config['GCPVariables']['jsonKeyFile'])
> spark.conf.set("BigQueryProjectId",
> config['GCPVariables']['projectId'])
> spark.conf.set("BigQueryDatasetLocation",
> config['GCPVariables']['datasetLocation'])
> spark.conf.set("google.cloud.auth.service.account.enable", "true")
> spark.conf.set("fs.gs.project.id",
> config['GCPVariables']['projectId'])
> spark.conf.set("fs.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
> spark.conf.set("fs.AbstractFileSystem.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
> spark.conf.set("temporaryGcsBucket",
> config['GCPVariables']['tmp_bucket'])
> spark.conf.set("spark.sql.streaming.checkpointLocation",
> config['GCPVariables']['tmp_bucket'])
> return spark
> except Exception as e:
> print(f"""{e}, quitting""")
> sys.exit(1)
>
> Note the setting for GCP temporary bucket for staging Spark writes before
> pushing data into bigQuery table.
>
> The connection from Spark to BigQuery itself is pretty simplified. for
> example to reads from BQ table you can do
>
> def loadTableFromBQ(spark,dataset,tableName):
> try:
> read_df = spark.read. \
> format("bigquery"). \
> option("credentialsFile",
> config['GCPVariables']['jsonKeyFile']). \
> option("dataset", dataset). \
> option("table", tableName). \
> load()
> return read_df
> except Exception as e:
> print(f"""{e}, quitting""")
> sys.exit(1)
>
> and how to read it
>
>   read_df = s.loadTableFromBQ(self.spark,
> config['GCPVariables']['targetDataset'], config['GCPVariables']['ATable'])
>
> OK each connection will be lazily evaluated bar checking that the
> underlying table exists
>
> The next stage is to create a read_df Data Frame for each table and you do
> joins join etc in Spark itself.
>
> At times it is more efficient for BigQuery to do the join itself and
> create a result set table in BigQuery dataset that you can import into
> Spark.
>
> Whatever approach there is a solution and as usual your mileage varies

Re: Bridging gap between Spark UI and Code

2021-05-24 Thread mhawes
@Wenchen Fan, understood that the mapping of query plan to application code
is very hard. I was wondering if we might be able to instead just handle the
mapping from the final physical plan to the stage graph. So for example
youā€™d be able to tell what part of the plan generated which stages. I feel
this would provide the most benefit without having to worry about several
optimisation steps.

The main issue as I see it is that currently, if thereā€™s a failing stage,
itā€™s almost impossible to track down the part of the plan that generated the
stage. Would this be possible? If not, do you have any other suggestions for
this kind of debugging?

Best,
Matt



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: About Spark executs sqlscript

2021-05-24 Thread Mich Talebzadeh
Well, Spark to BigQuery API is very efficient in doing what it needs to do.
Personally I have never found a JDBC connection to BigQuery that works
under all circumstances

.
In a typical environment you need to set-up your connection variable to
BigQuery from Spark.

These are my recommended ones

def setSparkConfBQ(spark):
try:
spark.conf.set("GcpJsonKeyFile",
config['GCPVariables']['jsonKeyFile'])
spark.conf.set("BigQueryProjectId",
config['GCPVariables']['projectId'])
spark.conf.set("BigQueryDatasetLocation",
config['GCPVariables']['datasetLocation'])
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("fs.gs.project.id",
config['GCPVariables']['projectId'])
spark.conf.set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("temporaryGcsBucket",
config['GCPVariables']['tmp_bucket'])
spark.conf.set("spark.sql.streaming.checkpointLocation",
config['GCPVariables']['tmp_bucket'])
return spark
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

Note the setting for GCP temporary bucket for staging Spark writes before
pushing data into bigQuery table.

The connection from Spark to BigQuery itself is pretty simplified. for
example to reads from BQ table you can do

def loadTableFromBQ(spark,dataset,tableName):
try:
read_df = spark.read. \
format("bigquery"). \
option("credentialsFile",
config['GCPVariables']['jsonKeyFile']). \
option("dataset", dataset). \
option("table", tableName). \
load()
return read_df
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

and how to read it

  read_df = s.loadTableFromBQ(self.spark,
config['GCPVariables']['targetDataset'], config['GCPVariables']['ATable'])

OK each connection will be lazily evaluated bar checking that the
underlying table exists

The next stage is to create a read_df Data Frame for each table and you do
joins join etc in Spark itself.

At times it is more efficient for BigQuery to do the join itself and create
a result set table in BigQuery dataset that you can import into Spark.

Whatever approach there is a solution and as usual your mileage varies so
to speak.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 14 May 2021 at 01:50, bo zhao  wrote:

> Hi Team,
>
> I've followed Spark community for several years. This is my first time for
> asking help. I hope you guys can give some experience.
>
> I want to develop a spark application with processing a sqlscript file.
> The data is on BigQuery.
> For example, the sqlscript is:
>
> delete from tableA;
> insert into tableA select b.columnB1, c.columnC2 from tableB b, tableC c;
>
>
> I can parse this file. In my opinion, After parsing the file, steps should
> follow these below:
>
> #step1: read tableB, tableC into memory(Spark)
> #step2. register views for tableB's dataframe and tableC's dataframe
> #step3. use spark.sql("select b.columnB1, c.columnC2 from tableB b, tableC
> c") to get a new dataframe
> #step4. new dataframe.write().() to tableA using mode of "OVERWRITE"
>
> My question:
> #1 If there are 10 tables or more tables, do I need to read each table
> into memory though Spark bases on memory compution?
> #2 Is there a much easier way to deal with my scenarios, for example, I
> just define the datasource(BigQuery) and just parse sqlscript file, others
> are run by Spark.
>
> Please share your experience or idea.
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Mich Talebzadeh
Plus some operators can be repeated because if a node dies, spark would
need to rebuild that state again from RDD lineage.

HTH

Mich


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 24 May 2021 at 18:22, Wenchen Fan  wrote:

> I believe you can already see each plan change Spark did to your query
> plan in the debug-level logs. I think it's hard to do in the web UI as
> keeping all these historical query plans is expensive.
>
> Mapping the query plan to your application code is nearly impossible, as
> so many optimizations can happen (some operators can be removed, some
> operators can be replaced by different ones, some operators can be added by
> Spark).
>
> On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
>  wrote:
>
>> This would be great.
>>
>>
>>
>> At least for logical nodes, would it be possible to re-use the existing
>> Utils.getCallSite
>> 
>> to populate a field when nodes are created? I suppose most value would come
>> from eventually passing the call-sites along to physical nodes. But maybe
>> just as starting point Spark could display the call-site only with
>> unoptimized logical plans? Users would still get a better sense for how the
>> planā€™s structure relates to their code.
>>
>>
>>
>> *From: *mhawes 
>> *Date: *Friday, 21 May 2021 at 22:36
>> *To: *dev@spark.apache.org 
>> *Subject: *Re: Bridging gap between Spark UI and Code
>>
>> CAUTION: This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Phishing" button built into Outlook.
>>
>>
>> Reviving this thread to ask whether any of the Spark maintainers would
>> consider helping to scope a solution for this. Michal outlines the problem
>> in this thread, but to clarify. The issue is that for very complex spark
>> application where the Logical Plans often span many pages, it is extremely
>> hard to figure out how the stages in the Spark UI/RDD operations link to
>> the
>> Logical Plan that generated them.
>>
>> Now, obviously this is a hard problem to solve given the various
>> optimisations and transformations that go on in between these two stages.
>> However I wanted to raise it as a potential option as I think it would be
>> /extremely/ valuable for Spark users.
>>
>> My two main ideas are either:
>>  - To carry a reference to the original plan around when
>> planning/optimising.
>>  - To maintain a separate mapping for each planning/optimisation step that
>> maps from source to target. Im thinking along the lines of JavaScript
>> sourcemaps.
>>
>> It would be great to get the opinion of an experienced Spark maintainer on
>> this, given the complexity.
>>
>>
>>
>> --
>> Sent from:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>


Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread Ryan Blue
I don't think that it makes sense to discuss a different approach in the PR
rather than in the vote. Let's discuss this now since that's the purpose of
an SPIP.

On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:

> Hi everyone, Iā€™d like to start a vote for the ViewCatalog design proposal
> (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, Iā€™ll
> update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I donā€™t think this is a good idea because ā€¦
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Tom Graves
 so repartition() would look at some other config 
(spark.sql.adaptive.advisoryPartitionSizeInBytes)Ā to decide the size to use to 
partition it on then? Ā Does it require AQE? Ā If so what does a repartition() 
call do if AQE is not enabled? this is essentially a new api so would 
repartitionBySize or something be less confusing to users who already use 
repartition(num_partitions).
Tom
On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan  
wrote:  
 
 Ideally this should be handled by the underlying data source to produce a 
reasonably partitioned RDD as the input data. However if we already have a 
poorly partitioned RDD at hand and want to repartition it properly, I think an 
extra shuffle is required so that we can know the partition size first.
That said, I think callingĀ `.repartition()` with no args is indeed a good 
solution for this problem.
On Sat, May 22, 2021 at 1:12 AM mhawes  wrote:

Adding /another/ update to say that I'm currently planning on using a
recently introduced feature whereby calling `.repartition()` with no args
will cause the dataset to be optimised by AQE. This actually suits our
use-case perfectly!

Example:

Ā  Ā  Ā  Ā  sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
Ā  Ā  Ā  Ā  Dataset dataset = sparkSession.range(1, 4, 1,
4).repartition();

Ā  Ā  Ā  Ā  assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
// true


Relevant PRs/Issues:
[SPARK-31220][SQL] repartition obeys initialPartitionNum when
adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when
AQE is enabled https://github.com/apache/spark/pull/28900
[SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
sql when AQE is enabled https://github.com/apache/spark/pull/28952



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


  

[VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Hi everyone, Iā€™d like to start a vote for the ViewCatalog design proposal
(SPIP).

The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.

The full SPIP doc is here:
https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing

Please vote on the SPIP in the next 72 hours. Once it is approved, Iā€™ll
update the PR for review.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I donā€™t think this is a good idea because ā€¦


Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Great! I will start a vote thread.

On Mon, May 24, 2021 at 10:54 AM Wenchen Fan  wrote:

> Yea let's move forward first. We can discuss the caching approach
> and TableViewCatalog approach during the PR review.
>
> On Tue, May 25, 2021 at 1:48 AM John Zhuge  wrote:
>
>> Hi everyone,
>>
>> Is there any more discussion before we start a vote on ViewCatalog? With
>> FunctionCatalog merged, I hope this feature can complete the offerings of
>> catalog plugins in 3.2.
>>
>> Once approved, I will refresh the WIP PR. Implementation details can be
>> ironed out during review.
>>
>> Thanks,
>>
>> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue 
>> wrote:
>>
>>> An extra RPC call is a concern for the catalog implementation. It is
>>> simple to cache the result of a call to avoid a second one if the catalog
>>> chooses.
>>>
>>> I don't think that an extra RPC that can be easily avoided is a
>>> reasonable justification to add caches in Spark. For one thing, it doesn't
>>> solve the problem because the proposed API still requires separate lookups
>>> for tables and views.
>>>
>>> The only solution that would help is to use a combined trait, but that
>>> has issues. For one, view substitution is much cleaner when it happens well
>>> before table resolution. And, View and Table are very different objects;
>>> returning Object from this API doesn't make much sense.
>>>
>>> One extra RPC is not unreasonable, and the choice should be left to
>>> sources. That's the easiest place to cache results from the underlying
>>> store.
>>>
>>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:
>>>
 Moving back the discussion to this thread. The current argument is how
 to avoid extra RPC calls for catalogs supporting both table and view. There
 are several options:
 1. ignore it as extra PRC calls are cheap compared to the query
 execution
 2. have a per session cache for loaded table/view
 3. have a per query cache for loaded table/view
 4. add a new trait TableViewCatalog

 I think it's important to avoid perf regression with new APIs. RPC
 calls can be significant for short queries. We may also double the RPC
 traffic which is bad for the metastore service. Normally I would not
 recommend caching as cache invalidation is a hard problem. Personally I
 prefer option 4 as it only affects catalogs that support both table and
 view, and it fits the hive catalog very well.

 On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:

> SPIP
> 
> has been updated. Please review.
>
> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>
>> Wenchen, sorry for the delay, I will post an update shortly.
>>
>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan 
>> wrote:
>>
>>> Any updates here? I agree that a new View API is better, but we need
>>> a solution to avoid performance regression. We need to elaborate on the
>>> cache idea.
>>>
>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>>
 I think it is a good idea to keep tables and views separate.

 The main two arguments Iā€™ve heard for combining lookup into a
 single function are the ones brought up in this thread. First, an
 identifier in a catalog must be either a view or a table and should not
 collide. Second, a single lookup is more likely to require a single 
 RPC. I
 think the RPC concern is well addressed by caching, which we already 
 do in
 the Spark catalog, so Iā€™ll primarily focus on the first.

 Table/view name collision is unlikely to be a problem. Metastores
 that support both today store them in a single namespace, so this is 
 not a
 concern for even a naive implementation that talks to the Hive 
 MetaStore. I
 know that a new metastore catalog could choose to implement both
 ViewCatalog and TableCatalog and store the two sets separately, but 
 that
 would be a very strange choice: if the metastore itself has different
 namespaces for tables and views, then it makes much more sense to 
 expose
 them through separate catalogs because Spark will always prefer one 
 over
 the other.

 In a similar line of reasoning, catalogs that expose both views and
 tables are much more rare than catalogs that only expose one. For 
 example,
 v2 catalogs for JDBC and Cassandra expose data through the Table 
 interface
 and implementing ViewCatalog would make little sense. Exposing new data
 sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
 are
 likely to be the same. Say I have a way to convert Pig statements or 
 some
 other representation into a SQL view. It

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread Wenchen Fan
Yea let's move forward first. We can discuss the caching approach
and TableViewCatalog approach during the PR review.

On Tue, May 25, 2021 at 1:48 AM John Zhuge  wrote:

> Hi everyone,
>
> Is there any more discussion before we start a vote on ViewCatalog? With
> FunctionCatalog merged, I hope this feature can complete the offerings of
> catalog plugins in 3.2.
>
> Once approved, I will refresh the WIP PR. Implementation details can be
> ironed out during review.
>
> Thanks,
>
> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue 
> wrote:
>
>> An extra RPC call is a concern for the catalog implementation. It is
>> simple to cache the result of a call to avoid a second one if the catalog
>> chooses.
>>
>> I don't think that an extra RPC that can be easily avoided is a
>> reasonable justification to add caches in Spark. For one thing, it doesn't
>> solve the problem because the proposed API still requires separate lookups
>> for tables and views.
>>
>> The only solution that would help is to use a combined trait, but that
>> has issues. For one, view substitution is much cleaner when it happens well
>> before table resolution. And, View and Table are very different objects;
>> returning Object from this API doesn't make much sense.
>>
>> One extra RPC is not unreasonable, and the choice should be left to
>> sources. That's the easiest place to cache results from the underlying
>> store.
>>
>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:
>>
>>> Moving back the discussion to this thread. The current argument is how
>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>> are several options:
>>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>>> 2. have a per session cache for loaded table/view
>>> 3. have a per query cache for loaded table/view
>>> 4. add a new trait TableViewCatalog
>>>
>>> I think it's important to avoid perf regression with new APIs. RPC calls
>>> can be significant for short queries. We may also double the RPC
>>> traffic which is bad for the metastore service. Normally I would not
>>> recommend caching as cache invalidation is a hard problem. Personally I
>>> prefer option 4 as it only affects catalogs that support both table and
>>> view, and it fits the hive catalog very well.
>>>
>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>>>
 SPIP
 
 has been updated. Please review.

 On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:

> Wenchen, sorry for the delay, I will post an update shortly.
>
> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan 
> wrote:
>
>> Any updates here? I agree that a new View API is better, but we need
>> a solution to avoid performance regression. We need to elaborate on the
>> cache idea.
>>
>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>
>>> I think it is a good idea to keep tables and views separate.
>>>
>>> The main two arguments Iā€™ve heard for combining lookup into a single
>>> function are the ones brought up in this thread. First, an identifier 
>>> in a
>>> catalog must be either a view or a table and should not collide. 
>>> Second, a
>>> single lookup is more likely to require a single RPC. I think the RPC
>>> concern is well addressed by caching, which we already do in the Spark
>>> catalog, so Iā€™ll primarily focus on the first.
>>>
>>> Table/view name collision is unlikely to be a problem. Metastores
>>> that support both today store them in a single namespace, so this is 
>>> not a
>>> concern for even a naive implementation that talks to the Hive 
>>> MetaStore. I
>>> know that a new metastore catalog could choose to implement both
>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>> would be a very strange choice: if the metastore itself has different
>>> namespaces for tables and views, then it makes much more sense to expose
>>> them through separate catalogs because Spark will always prefer one over
>>> the other.
>>>
>>> In a similar line of reasoning, catalogs that expose both views and
>>> tables are much more rare than catalogs that only expose one. For 
>>> example,
>>> v2 catalogs for JDBC and Cassandra expose data through the Table 
>>> interface
>>> and implementing ViewCatalog would make little sense. Exposing new data
>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
>>> are
>>> likely to be the same. Say I have a way to convert Pig statements or 
>>> some
>>> other representation into a SQL view. It would make little sense to 
>>> combine
>>> that with some other TableCatalog.
>>>
>>> I also donā€™t think there is benefit from an API perspective to
>>> justify combining the Table and View interfaces. The two sha

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Hi everyone,

Is there any more discussion before we start a vote on ViewCatalog? With
FunctionCatalog merged, I hope this feature can complete the offerings of
catalog plugins in 3.2.

Once approved, I will refresh the WIP PR. Implementation details can be
ironed out during review.

Thanks,

On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue  wrote:

> An extra RPC call is a concern for the catalog implementation. It is
> simple to cache the result of a call to avoid a second one if the catalog
> chooses.
>
> I don't think that an extra RPC that can be easily avoided is a reasonable
> justification to add caches in Spark. For one thing, it doesn't solve the
> problem because the proposed API still requires separate lookups for tables
> and views.
>
> The only solution that would help is to use a combined trait, but that has
> issues. For one, view substitution is much cleaner when it happens well
> before table resolution. And, View and Table are very different objects;
> returning Object from this API doesn't make much sense.
>
> One extra RPC is not unreasonable, and the choice should be left to
> sources. That's the easiest place to cache results from the underlying
> store.
>
> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:
>
>> Moving back the discussion to this thread. The current argument is how to
>> avoid extra RPC calls for catalogs supporting both table and view. There
>> are several options:
>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>> 2. have a per session cache for loaded table/view
>> 3. have a per query cache for loaded table/view
>> 4. add a new trait TableViewCatalog
>>
>> I think it's important to avoid perf regression with new APIs. RPC calls
>> can be significant for short queries. We may also double the RPC
>> traffic which is bad for the metastore service. Normally I would not
>> recommend caching as cache invalidation is a hard problem. Personally I
>> prefer option 4 as it only affects catalogs that support both table and
>> view, and it fits the hive catalog very well.
>>
>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>>
>>> SPIP
>>> 
>>> has been updated. Please review.
>>>
>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>>>
 Wenchen, sorry for the delay, I will post an update shortly.

 On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:

> Any updates here? I agree that a new View API is better, but we need a
> solution to avoid performance regression. We need to elaborate on the 
> cache
> idea.
>
> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>
>> I think it is a good idea to keep tables and views separate.
>>
>> The main two arguments Iā€™ve heard for combining lookup into a single
>> function are the ones brought up in this thread. First, an identifier in 
>> a
>> catalog must be either a view or a table and should not collide. Second, 
>> a
>> single lookup is more likely to require a single RPC. I think the RPC
>> concern is well addressed by caching, which we already do in the Spark
>> catalog, so Iā€™ll primarily focus on the first.
>>
>> Table/view name collision is unlikely to be a problem. Metastores
>> that support both today store them in a single namespace, so this is not 
>> a
>> concern for even a naive implementation that talks to the Hive 
>> MetaStore. I
>> know that a new metastore catalog could choose to implement both
>> ViewCatalog and TableCatalog and store the two sets separately, but that
>> would be a very strange choice: if the metastore itself has different
>> namespaces for tables and views, then it makes much more sense to expose
>> them through separate catalogs because Spark will always prefer one over
>> the other.
>>
>> In a similar line of reasoning, catalogs that expose both views and
>> tables are much more rare than catalogs that only expose one. For 
>> example,
>> v2 catalogs for JDBC and Cassandra expose data through the Table 
>> interface
>> and implementing ViewCatalog would make little sense. Exposing new data
>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
>> are
>> likely to be the same. Say I have a way to convert Pig statements or some
>> other representation into a SQL view. It would make little sense to 
>> combine
>> that with some other TableCatalog.
>>
>> I also donā€™t think there is benefit from an API perspective to
>> justify combining the Table and View interfaces. The two share only 
>> schema
>> and properties, and are handled very differently internally ā€” a Viewā€™s 
>> SQL
>> query is parsed and substituted into the plan, while a Table is wrapped 
>> in
>> a relation that eventually becomes a Scan node using SupportsRead. A 
>> viewā€™s

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query
connector (should be existing already?) and register table B and C and temp
views in Spark.

On Fri, May 14, 2021 at 8:50 AM bo zhao  wrote:

> Hi Team,
>
> I've followed Spark community for several years. This is my first time for
> asking help. I hope you guys can give some experience.
>
> I want to develop a spark application with processing a sqlscript file.
> The data is on BigQuery.
> For example, the sqlscript is:
>
> delete from tableA;
> insert into tableA select b.columnB1, c.columnC2 from tableB b, tableC c;
>
>
> I can parse this file. In my opinion, After parsing the file, steps should
> follow these below:
>
> #step1: read tableB, tableC into memory(Spark)
> #step2. register views for tableB's dataframe and tableC's dataframe
> #step3. use spark.sql("select b.columnB1, c.columnC2 from tableB b, tableC
> c") to get a new dataframe
> #step4. new dataframe.write().() to tableA using mode of "OVERWRITE"
>
> My question:
> #1 If there are 10 tables or more tables, do I need to read each table
> into memory though Spark bases on memory compution?
> #2 Is there a much easier way to deal with my scenarios, for example, I
> just define the datasource(BigQuery) and just parse sqlscript file, others
> are run by Spark.
>
> Please share your experience or idea.
>


Re: Purpose of OffsetHolder as a LeafNode?

2021-05-24 Thread Wenchen Fan
It's just an immediate place holder to update the query plan in each
micro-batch.

On Sat, May 15, 2021 at 10:29 PM Jacek Laskowski  wrote:

> Hi,
>
> Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode?
> What logical plan could it be part of?
>
> [1]
> https://github.com/apache/spark/blob/1a6708918b32e821bff26a00d2d8b7236b29515f/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L633
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>


Re: Secrets store for DSv2

2021-05-24 Thread Wenchen Fan
You can take a look at PartitionReaderFactory.

It's created at the driver side, serialized and sent to the executor side.
For the write side, there is a similar channel: DataWriterFactory

On Wed, May 19, 2021 at 4:37 AM Andrew Melo  wrote:

> Hello,
>
> When implementing a DSv2 datasource, where is an appropriate place to
> store/transmit secrets from the driver to the executors? Is there
> built-in spark functionality for that, or is my best bet to stash it
> as a member variable in one of the classes that gets sent to the
> executors?
>
> Thanks!
> Andrew
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Wenchen Fan
Ideally this should be handled by the underlying data source to produce a
reasonably partitioned RDD as the input data. However if we already have a
poorly partitioned RDD at hand and want to repartition it properly, I think
an extra shuffle is required so that we can know the partition size first.

That said, I think calling `.repartition()` with no args is indeed a good
solution for this problem.

On Sat, May 22, 2021 at 1:12 AM mhawes  wrote:

> Adding /another/ update to say that I'm currently planning on using a
> recently introduced feature whereby calling `.repartition()` with no args
> will cause the dataset to be optimised by AQE. This actually suits our
> use-case perfectly!
>
> Example:
>
> sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
> Dataset dataset = sparkSession.range(1, 4, 1,
> 4).repartition();
>
> assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
> // true
>
>
> Relevant PRs/Issues:
> [SPARK-31220][SQL] repartition obeys initialPartitionNum when
> adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
> [SPARK-32056][SQL
> ]
> Coalesce partitions for repartition by expressions when
> AQE is enabled https://github.com/apache/spark/pull/28900
> [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
> sql when AQE is enabled https://github.com/apache/spark/pull/28952
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Wenchen Fan
I believe you can already see each plan change Spark did to your query plan
in the debug-level logs. I think it's hard to do in the web UI as keeping
all these historical query plans is expensive.

Mapping the query plan to your application code is nearly impossible, as so
many optimizations can happen (some operators can be removed, some
operators can be replaced by different ones, some operators can be added by
Spark).

On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
 wrote:

> This would be great.
>
>
>
> At least for logical nodes, would it be possible to re-use the existing
> Utils.getCallSite
> 
> to populate a field when nodes are created? I suppose most value would come
> from eventually passing the call-sites along to physical nodes. But maybe
> just as starting point Spark could display the call-site only with
> unoptimized logical plans? Users would still get a better sense for how the
> planā€™s structure relates to their code.
>
>
>
> *From: *mhawes 
> *Date: *Friday, 21 May 2021 at 22:36
> *To: *dev@spark.apache.org 
> *Subject: *Re: Bridging gap between Spark UI and Code
>
> CAUTION: This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Phishing" button built into Outlook.
>
>
> Reviving this thread to ask whether any of the Spark maintainers would
> consider helping to scope a solution for this. Michal outlines the problem
> in this thread, but to clarify. The issue is that for very complex spark
> application where the Logical Plans often span many pages, it is extremely
> hard to figure out how the stages in the Spark UI/RDD operations link to
> the
> Logical Plan that generated them.
>
> Now, obviously this is a hard problem to solve given the various
> optimisations and transformations that go on in between these two stages.
> However I wanted to raise it as a potential option as I think it would be
> /extremely/ valuable for Spark users.
>
> My two main ideas are either:
>  - To carry a reference to the original plan around when
> planning/optimising.
>  - To maintain a separate mapping for each planning/optimisation step that
> maps from source to target. Im thinking along the lines of JavaScript
> sourcemaps.
>
> It would be great to get the opinion of an experienced Spark maintainer on
> this, given the complexity.
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Will Raschkowski
This would be great.

At least for logical nodes, would it be possible to re-use the existing 
Utils.getCallSite
 to populate a field when nodes are created? I suppose most value would come 
from eventually passing the call-sites along to physical nodes. But maybe just 
as starting point Spark could display the call-site only with unoptimized 
logical plans? Users would still get a better sense for how the planā€™s 
structure relates to their code.

From: mhawes 
Date: Friday, 21 May 2021 at 22:36
To: dev@spark.apache.org 
Subject: Re: Bridging gap between Spark UI and Code
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Phishing" button built into Outlook.


Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising.
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity.



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org