Re: [ANNOUNCE] Apache Spark 3.0.3 released

2021-06-25 Thread L . C . Hsieh
Thanks Yi for the work!

On 2021/06/25 05:51:38, Yi Wu  wrote: 
> We are happy to announce the availability of Spark 3.0.3!
> 
> Spark 3.0.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.0 maintenance branch of Spark. We strongly
> recommend all 3.0 users to upgrade to this stable release.
> 
> To download Spark 3.0.3, head over to the download page:
> https://spark.apache.org/downloads.html
> 
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-3.html
> 
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
> 
> Yi
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-25 Thread Chao Sun
Thanks all for the feedback! Yes I agree that we should target this for
Apache Spark 3.3 release. I'll put this aside for now and pick it up again
after the 3.2 release is finished.

> And maybe the current naming leaves the possibility for a "hadoop-3.5" or
something if that needed to be different.

Yes, that's a good point, although I was under the impression that the
Spark community aims to only support a single Hadoop 3.x profile, in which
case we won't have `hadoop-3` and `hadoop-3.5` in parallel.

Chao


On Thu, Jun 24, 2021 at 10:25 PM Gengliang Wang  wrote:

> +1 for targeting the renaming for Apache Spark 3.3 at the current phase.
>
> On Fri, Jun 25, 2021 at 6:55 AM DB Tsai  wrote:
>
>> +1 on renaming.
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Jun 24, 2021, at 11:41 AM, Chao Sun  wrote:
>>
>> Hi,
>>
>> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
>> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
>> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
>> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
>> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
>> something worth doing as part of Spark 3.2.0 release?
>>
>> Best,
>> Chao
>>
>>
>>


Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-25 Thread huaxin gao
I took a quick look at the PR and it looks like a great feature to have. It
provides unified APIs for data sources to perform the commonly used
operations easily and efficiently, so users don't have to implement
customer extensions on their own. Thanks Anton for the work!

On Thu, Jun 24, 2021 at 9:42 PM L. C. Hsieh  wrote:

> Thanks Anton. I'm voluntarily to be the shepherd of the SPIP. This is also
> my first time to shepherd a SPIP, so please let me know if anything I can
> improve.
>
> This looks great features and the rationale claimed by the proposal makes
> sense. These operations are getting more common and more important in big
> data workloads. Instead of building custom extensions by individual data
> sources, it makes more sense to support the API from Spark.
>
> Please provide your thoughts about the proposal and the design. Appreciate
> your feedback. Thank you!
>
> On 2021/06/24 23:53:32, Anton Okolnychyi  wrote:
> > Hey everyone,
> >
> > I'd like to start a discussion on adding support for executing row-level
> > operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> > execution should be the same across data sources and the best way to do
> > that is to implement it in Spark.
> >
> > Right now, Spark can only parse and to some extent analyze DELETE,
> UPDATE,
> > MERGE commands. Data sources that support row-level changes have to build
> > custom Spark extensions to execute such statements. The goal of this
> effort
> > is to come up with a flexible and easy-to-use API that will work across
> > data sources.
> >
> > Design doc:
> >
> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
> >
> > PR for handling DELETE statements:
> > https://github.com/apache/spark/pull/33008
> >
> > Any feedback is more than welcome.
> >
> > Liang-Chi was kind enough to shepherd this effort. Thanks!
> >
> > - Anton
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Apache Spark 3.0.3 released

2021-06-25 Thread Dongjoon Hyun
Thank you, Yi!



On Thu, Jun 24, 2021 at 10:52 PM Yi Wu  wrote:

> We are happy to announce the availability of Spark 3.0.3!
>
> Spark 3.0.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.0 maintenance branch of Spark. We strongly
> recommend all 3.0 users to upgrade to this stable release.
>
> To download Spark 3.0.3, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-3.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Yi
>
>


Fail to run benchmark in Github Action

2021-06-25 Thread Kevin Su
Hi all,

I try to run a benchmark test in GitHub action in my fork, and I faced the
below error.
https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true
java.lang.AssertionError: assertion failed: spark.test.home is not set!
23799

at scala.Predef$.assert(Predef.scala:223)
23800

at org.apache.spark.deploy.worker.Worker.(Worker.scala:148)
23801

at
org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954)

23802

at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68)

23803

at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65)

23804

at scala.collection.immutable.Range.foreach(Range.scala:158)

After I add the  "--driver-java-options
"-Dspark.test.home=$GITHUB_WORKSPACE" \" in benchmark.yml


I still got the below error.
https://github.com/pingsutw/spark/runs/2911027350?check_suite_focus=true
.
Do I need to set something up in my fork?
after 1900, vec on, rebase EXCEPTION 7474 7511 58 13.4 74.7 2.7X
4427
after
1900, vec on, rebase LEGACY 9228 9296 60 10.8 92.3 2.2X
4428
after
1900, vec on, rebase CORRECTED 7553 7678 128 13.2 75.5 2.7X
4429
before
1900, vec off, rebase LEGACY 23280 23362 71 4.3 232.8 0.9X
4430
before
1900, vec off, rebase CORRECTED 20548 20630 119 4.9 205.5 1.0X
4431
before
1900, vec on, rebase LEGACY 12210 12239 37 8.2 122.1 1.7X
4432
before
1900, vec on, rebase CORRECTED 7486 7489 2 13.4 74.9 2.7X
4433

4434
Running
benchmark: Save TIMESTAMP_MICROS to parquet
4435

Running case: after 1900, noop
4436

Stopped after 1 iterations, 4003 ms
4437

Running case: before 1900, noop
4438

Stopped after 1 iterations, 3965 ms
4439

Running case: after 1900, rebase EXCEPTION
4440

Stopped after 1 iterations, 18339 ms
4441

Running case: after 1900, rebase LEGACY
4442

Stopped after 1 iterations, 18375 ms
4443

Running case: after 1900, rebase CORRECTED


Stopped after 1 iterations, 18716 ms
4445

Running case: before 1900, rebase LEGACY
4446
Error:
The operation was canceled.


Re: Spark on Kubernetes scheduler variety

2021-06-25 Thread Yikun Jiang
Oops, sorry for the error link, it should be:

We will also prepare to propose an initial design and POC[3] on a shared
branch (based on spark master branch) where we can collaborate on it, so I
created the spark-volcano[1] org in github to make it happen.

[3]
https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4


And
Regards,
Yikun


Yikun Jiang  于2021年6月25日周五 上午11:53写道:

> Hi, folks.
>
> As @Klaus mentioned, We have some work on Spark on k8s with volcano native
> support. Also, there were also some production deployment validation from
> our partners in China, like JingDong, XiaoHongShu, VIPshop.
>
> We will also prepare to propose an initial design and POC[3] on a shared
> branch (based on spark master branch) where we can collaborate on it, so I
> created the spark-volcano[1] org in github to make it happen.
>
> Pls feel free to comment on it [2] if you guys have any questions or
> concerns.
>
> [1] https://github.com/spark-volcano
> [2] https://github.com/spark-volcano/spark/issues/1
> [3]
> https://github.com/huawei-cloudnative/spark/commit/6c1f37525f026353eaead34216d47dad653f13a4
>
>


> Regards,
> Yikun
>
> Holden Karau  于2021年6月25日周五 上午12:00写道:
>
>> Hi Mich,
>>
>> I certainly think making Spark on Kubernetes run well is going to be a
>> challenge. However I think, and I could be wrong about this as well, that
>> in terms of cluster managers Kubernetes is likely to be our future. Talking
>> with people I don't hear about new standalone, YARN or mesos deployments of
>> Spark, but I do hear about people trying to migrate to Kubernetes.
>>
>> To be clear I certainly agree that we need more work on structured
>> streaming, but its important to remember that the Spark developers are not
>> all fully interchangeable, we work on the things that we're interested in
>> pursuing so even if structured streaming needs more love if I'm not super
>> interested in structured streaming I'm less likely to work on it. That
>> being said I am certainly spinning up a bit more in the Spark SQL area
>> especially around our data source/connectors because I can see the need
>> there too.
>>
>> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc  (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
 I think these approaches are good, but there are limitations (eg
 dynamic scaling) without us making changes inside of the Spark Kube
 scheduler.

 Certainly whichever scheduler extensions we add support for we should
 collaborate with the people developing those extensions insofar as they are
 interested. My first place that I checked was #sig-scheduling 

Lift the limitation of Spark JDBC handling of individual rows with DML

2021-06-25 Thread Mich Talebzadeh
*Challenge*

Insert data from Spark dataframe when one or more columns in theOracle
table rely on some derived_colums dependent on data in one or more
dataframe columns.

Standard JDBC from Spark to Oracle does batch insert of dataframe into
Oracle *so it cannot handle these derived columns*. Refer below

*dataFrame.* \
write. \
format("jdbc"). \
option("url", url of Oracle). \
*option("dbtable", schema.tableName)*. \
option("user", user). \
option("password", password). \
option("driver", Oracle driver). \
mode(mode). \
*save()*

This writes the whole content of the dataframe to the Oracle table. Cannot
replace  schema.tableName  with INSERT statement

*Possible solution*


   1. Need a cursor based solution. Create a cursor from Spark dataframe.
   So we can walk through every row and get the value of each column from the
   dataframe
   2. Oracle provides the cx_Oracle package.  cx_Oracle
    is a Python extension
   module that enables access to Oracle Database. It conforms to the Python
   database API 2.0 specification
    with a
   considerable number of additions and a couple of exclusions. It is
   maintained by Oracle.
   3. Using cx_Oracle we should be able to create a Connection type to
   Oracle and use Connection.cursor() to deal with rows. See below


This is an example

Create connection to Oracle. Need to install cx_oracle package in PySpark


import cx_Oracle

def loadIntoOracleTableWithCursor(self, df):
  # set Oracle details
  tableName = "randomdata"
fullyQualifiedTableName =
self.config['OracleVariables']['dbschema']+'.'+tableName
user = self.config['OracleVariables']['oracle_user']
password = self.config['OracleVariables']['oracle_password']
serverName = self.config['OracleVariables']['oracleHost']
port = self.config['OracleVariables']['oraclePort']
serviceName = self.config['OracleVariables']['serviceName']
dsn_tns = cx_Oracle.makedsn(serverName, port,
service_name=serviceName)
# create connection conn
conn = cx_Oracle.connect(user, password, dsn_tns)
cursor = conn.cursor()
# df is the dataframe containing the data. Let us build a cursor on
it.

   for row in df.rdd.collect():
# get individual column values from the dataframe
id = row[0]
clustered = row[1]
scattered = row[2]
randomised = row[3]
random_string = row[4]
small_vc = row[5]
padding = row[6]
# Build INSERT/SELECT statement to be executed in Oracle. This
is what we are sending for every row to the Oracle table. Oracle table has
a column called *derived_col *that dataframe does not have it.
  #  That is the one that is derived from some value on
the dataframe column(s). For example here I assign *derived_col = cos(id)* and
pass it in sqlText. You need {} to pass the value and enclose i single
quotes
   #  if the column is character type
sqlText = f"""insert into {fullyQualifiedTableName}
(id,clustered,scattered,randomised,random_string,small_vc,padding,
*derived_col)*
  values
({id},{clustered},{scattered},{randomised},'{random_string}','{small_vc}','{padding}',
*cos({id*}))"""
print(sqlText)
cursor.execute(sqlText)
conn.commit()

Our dataframe has 10 rows and id in Oracle table has been made the primary
key


scratch...@orasource.mich.LOCAL> CREATE TABLE scratchpad.randomdata
  2  (
  3  "ID" NUMBER(*,0),
  4  "CLUSTERED" NUMBER(*,0),
  5  "SCATTERED" NUMBER(*,0),
  6  "RANDOMISED" NUMBER(*,0),
  7  "RANDOM_STRING" VARCHAR2(50 BYTE),
  8  "SMALL_VC" VARCHAR2(50 BYTE),
  9  "PADDING" VARCHAR2(4000 BYTE),
 10  "DERIVED_COL" FLOAT(126)
 11  );

Table created.
scratch...@orasource.mich.LOCAL> ALTER TABLE scratchpad.randomdata ADD
CONSTRAINT randomdata_PK PRIMARY KEY (ID);
Table altered.

Run it and see the output of  print(sqlText)

insert into SCRATCHPAD.randomdata
(id,clustered,scattered,randomised,random_string,small_vc,padding,derived_col)
  values
(1,0.0,0.0,2.0,'KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ','

 
1','xxx',cos(1))

This works fine. It creates the rows and does a commit


*What is needed*


We need to implement a JDBC connection in Spark such that it handles DML in
addition to DQ.


JDBC option option("dbtable", schema.tableName) should be enhanced to
replace