Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Cheng Pan
+1 (non-binding)

* Verified SPARK-39313 has been address[1]
* Passed integration test w/ Apache Kyuubi (Incubating)[2]

[1] https://github.com/housepower/spark-clickhouse-connector/pull/123
[2] https://github.com/apache/incubator-kyuubi/pull/2817

Thanks,
Cheng Pan

On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth  wrote:
>
> +1 (non-binding)
>
> * Verified all checksums.
> * Verified all signatures.
> * Built from source, with multiple profiles, to full success, for Java 11 and 
> Scala 2.13:
> * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver -Pkubernetes 
> -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
> * Tests passed.
> * Ran several examples successfully:
> * bin/spark-submit --class org.apache.spark.examples.SparkPi 
> examples/jars/spark-examples_2.12-3.3.0.jar
> * bin/spark-submit --class 
> org.apache.spark.examples.sql.hive.SparkHiveExample 
> examples/jars/spark-examples_2.12-3.3.0.jar
> * bin/spark-submit 
> examples/src/main/python/streaming/network_wordcount.py localhost 
> * Tested some of the issues that blocked prior release candidates:
> * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) 
> t(x) UNION SELECT 1 AS a;'
> * bin/spark-sql -e "select date '2018-11-17' > 1"
> * SPARK-39293 ArrayAggregate fix
>
> Chris Nauroth
>
>
> On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:
>>
>> +1 (non-binding). Built and ran some internal test for Spark SQL.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> From: L. C. Hsieh 
>> Date: Tuesday, June 7, 2022 at 1:23 PM
>> To: dev 
>> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
>>
>> +1
>>
>> Liang-Chi
>>
>> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > Gengliang
>> >
>> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves  wrote:
>> >>
>> >> +1
>> >>
>> >> Tom Graves
>> >>
>> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>> >>  wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark 
>> >> > version 3.3.0.
>> >> >
>> >> > The vote is open until 11:59pm Pacific time June 8th and passes if a 
>> >> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 3.3.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v3.3.0-rc5 (commit 
>> >> > 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> > https://repository.apache.org/content/repositories/orgapachespark-1406
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>> >> >
>> >> > The list of bug fixes going into 3.3.0 can be found at the following 
>> >> > URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >> >
>> >> > This release is using the release script of the tag v3.3.0-rc5.
>> >> >
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate, then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the Java/Scala
>> >> > you can add the staging repository to your projects resolvers and test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 3.3.0?
>> >> > ===
>> >> > The current list of open tickets targeted at 3.3.0 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK  and search for "Target 
>> >> > Version/s" = 3.3.0
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the 

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Chris Nauroth
+1 (non-binding)

* Verified all checksums.
* Verified all signatures.
* Built from source, with multiple profiles, to full success, for Java 11
and Scala 2.13:
* build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver -Pkubernetes
-Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
* Tests passed.
* Ran several examples successfully:
* bin/spark-submit --class org.apache.spark.examples.SparkPi
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit --class
org.apache.spark.examples.sql.hive.SparkHiveExample
examples/jars/spark-examples_2.12-3.3.0.jar
* bin/spark-submit
examples/src/main/python/streaming/network_wordcount.py localhost 
* Tested some of the issues that blocked prior release candidates:
* bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true)
t(x) UNION SELECT 1 AS a;'
* bin/spark-sql -e "select date '2018-11-17' > 1"
* SPARK-39293 ArrayAggregate fix

Chris Nauroth


On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:

> +1 (non-binding). Built and ran some internal test for Spark SQL.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *L. C. Hsieh 
> *Date: *Tuesday, June 7, 2022 at 1:23 PM
> *To: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.3.0 (RC5)
>
> +1
>
> Liang-Chi
>
> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
> >
> > +1 (non-binding)
> >
> > Gengliang
> >
> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves 
> wrote:
> >>
> >> +1
> >>
> >> Tom Graves
> >>
> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
> >>  wrote:
> >> >
> >> > Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >> >
> >> > The vote is open until 11:59pm Pacific time June 8th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.3.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.3.0-rc5 (commit
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1406
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
> >> >
> >> > The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >> >
> >> > This release is using the release script of the tag v3.3.0-rc5.
> >> >
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 3.3.0?
> >> > ===
> >> > The current list of open tickets targeted at 3.3.0 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARK  and search for
> "Target Version/s" = 3.3.0
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> > be worked on immediately. Everything else please retarget to an
> >> > appropriate release.
> >> >
> >> > ==
> >> > But my bug isn't fixed?
> >> > ==
> >> > In order to make timely releases, we will typically not hold the
> >> > release unless the bug in question is a regression from the previous
> >> > release. That being said, if there is something which is a regression
> >> > that has not been correctly targeted please ping me or a committer to
> >> > help target the issue.
> >> >
> >> > Maxim Gekk
> >> >
> >> > Software Engineer
> >> >
> >> > Databricks, Inc.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To 

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Cheng Su
+1 (non-binding). Built and ran some internal test for Spark SQL.

Thanks,
Cheng Su

From: L. C. Hsieh 
Date: Tuesday, June 7, 2022 at 1:23 PM
To: dev 
Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
+1

Liang-Chi

On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
>
> +1 (non-binding)
>
> Gengliang
>
> On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves  wrote:
>>
>> +1
>>
>> Tom Graves
>>
>> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>>  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark version 
>> > 3.3.0.
>> >
>> > The vote is open until 11:59pm Pacific time June 8th and passes if a 
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.3.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see 
>> > http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.3.0-rc5 (commit 
>> > 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1406
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>> >
>> > The list of bug fixes going into 3.3.0 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >
>> > This release is using the release script of the tag v3.3.0-rc5.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.3.0?
>> > ===
>> > The current list of open tickets targeted at 3.3.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK
>> >   and search for "Target Version/s" = 3.3.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> > Maxim Gekk
>> >
>> > Software Engineer
>> >
>> > Databricks, Inc.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread L. C. Hsieh
+1

Liang-Chi

On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
>
> +1 (non-binding)
>
> Gengliang
>
> On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves  wrote:
>>
>> +1
>>
>> Tom Graves
>>
>> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>>  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark version 
>> > 3.3.0.
>> >
>> > The vote is open until 11:59pm Pacific time June 8th and passes if a 
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.3.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.3.0-rc5 (commit 
>> > 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1406
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>> >
>> > The list of bug fixes going into 3.3.0 can be found at the following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >
>> > This release is using the release script of the tag v3.3.0-rc5.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.3.0?
>> > ===
>> > The current list of open tickets targeted at 3.3.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> > Version/s" = 3.3.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> > Maxim Gekk
>> >
>> > Software Engineer
>> >
>> > Databricks, Inc.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Thomas Graves
+1

Tom Graves

On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
 wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 8th and passes if a majority 
> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc5 (commit 
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> https://github.com/apache/spark/tree/v3.3.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1406
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Spark] [SQL] Updating Spark from version 3.0.1 to 3.2.1 reduced functionality for working with parquet files

2022-06-07 Thread Amin Borjian
Thanks for answering. I found problem and create an issue for Spark with 
related pull request:

https://issues.apache.org/jira/browse/SPARK-39393

Thanks again for your help.


From: Enrico Minack 
Sent: Tuesday, June 7, 2022, 9:49 PM
To: Amin Borjian
Subject: Re: [Spark] [SQL] Updating Spark from version 3.0.1 to 3.2.1 reduced 
functionality for working with parquet files

Hi,

even though the config option has been around since 1.2.0, it might be that 
more filters are being pushed into Parquet after 3.0.1 under the same option.

Are you sure the filter had been pushed into Parquet in 3.0.1? Did you run 
df.explan(true) for both versions? Can you share the plans?

Enrico


Am 05.06.22 um 12:34 schrieb Amin Borjian:
Thanks for answer.

- looks like the error comes from the Parquet library, has the library version 
changed moving to 3.2.1? What are the parquet versions used in 3.0.1 and 3.2.1? 
Can you read that parquet file with the newer parquet library version natively 
(without Spark)? Then this might be a Parquet issue, not a Spark issue.

At first I think problem was from parquet library. Parquet library updated from 
version 1.10 to 1.12.2 in Spark 3.2.1
I checked all classes of exception stack trace in order to find any suspect 
change in 2020-2022. In more detail, I looked at the following classes:


  *   SchemaCompatibilityValidator
  *   Operators
  *   RowGroupFilter
  *   FilterCompat
  *   ParquetFileReader
  *   ParquetRecordReader

I do not find any big change related to problem. In fact, the problem we have 
is because of this line, which has existed since 2014 (based on Git history):

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java#L194

- Unless Spark 3.2.1 does predicate filter pushdown while 3.0.1 did not and it 
has never been supported by Parquet. Then disable filter pushdown feature 
should help: config("spark.sql.parquet.filterPushdown", false).

So I think it should be some how related to Spark (For example Spark call new 
functions of parquet library) Also I check option you mentioned from Spark code 
base:

val PARQUET_FILTER_PUSHDOWN_ENABLED = 
buildConf("spark.sql.parquet.filterPushdown")
 .doc("Enables Parquet filter push-down optimization when set to true.")
 .version("1.2.0")
 .booleanConf
 .createWithDefault(true)

This option has been around for a long time and we even had it enabled in the 
previous version (Spark 3.0.1).
Of course, disabling it solved the problem, but I wonder why we did not have a 
problem before.

It was difficult to check the Spark stack trace and I checked only the 
following two classes (after these classes the parquet library functions were 
called) I did not see any significant change in the direction of the problem 
created:


  *   ParquetFileFormat
  *   FileScanRDD

Do you think the parquet library could still be a problem? Have I forgotten a 
place in the review? Did Spark have a issue and has not used the 
"spark.sql.parquet.filterPushdown" setting correctly yet?

From: Enrico Minack
Sent: Sunday, June 5, 2022 1:32 PM
To: Amin Borjian; 
dev@spark.apache.org
Subject: Re: [Spark] [SQL] Updating Spark from version 3.0.1 to 3.2.1 reduced 
functionality for working with parquet files

Hi,

looks like the error comes from the Parquet library, has the library version 
changed moving to 3.2.1? What are the parquet versions used in 3.0.1 and 3.2.1? 
Can you read that parquet file with the newer parquet library version natively 
(without Spark)? Then this might be a Parquet issue, not a Spark issue.

Unless Spark 3.2.1 does predicate filter pushdown while 3.0.1 did not and it 
has never been supported by Parquet. Then disable filter pushdown feature 
should help: config("spark.sql.parquet.filterPushdown", false).

Enrico


Am 05.06.22 um 10:37 schrieb Amin Borjian:

Hi.

We are updating our Spark cluster from version 3.0.1 to 3.2.1 in order to get 
benefits from lots of improvement. Everything was good until we see strange 
behavior. Assume follow protobuf structure:

message Model {
 string name = 1;
 repeated string keywords = 2;
}

We store protobuf in parquet files with parquet library in HDFS. Before Spark 
3.2.1 we could run below query on Spark:

val df = spark.read.parquet("/path/to/parquet")
df.registerTempTable("models")
spark.sql("select * from models where array_contains(keywords, 
'XXX’)").show(false)

However after updating Spark to version 3.2.1, we receive following error (at 
the end of email). I think we lost good feature! Is it by mistake or on 
purpose? Can we some how fix problem without reverting or not? Should we wait 
for new release or not? Thank you in advance for help.

Caused by: java.lang.IllegalArgumentException: FilterPredicates do not 
currently support 

Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-07 Thread Martin Grund
On Tue, Jun 7, 2022 at 3:54 PM Steve Loughran 
wrote:

>
>
> On Fri, 3 Jun 2022 at 18:46, Martin Grund
>  wrote:
>
>> Hi Everyone,
>>
>> We would like to start a discussion on the "Spark Connect" proposal.
>> Please find the links below:
>>
>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>> *SPIP Document* -
>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>
>> *Excerpt from the document: *
>>
>> We propose to extend Apache Spark by building on the DataFrame API and
>> the underlying unresolved logical plans. The DataFrame API is widely used
>> and makes it very easy to iteratively express complex logic. We will
>> introduce Spark Connect, a remote option of the DataFrame API that
>> separates the client from the Spark server. With Spark Connect, Spark will
>> become decoupled, allowing for built-in remote connectivity: The decoupled
>> client SDK can be used to run interactive data exploration and connect to
>> the server for DataFrame operations.
>>
>> Spark Connect will benefit Spark developers in different ways: The
>> decoupled architecture will result in improved stability, as clients are
>> separated from the driver. From the Spark Connect client perspective, Spark
>> will be (almost) versionless, and thus enable seamless upgradability, as
>> server APIs can evolve without affecting the client API. The decoupled
>> client-server architecture can be leveraged to build close integrations
>> with local developer tooling. Finally, separating the client process from
>> the Spark server process will improve Spark’s overall security posture by
>> avoiding the tight coupling of the client inside the Spark runtime
>> environment.
>>
>
> one key finding on distributed systems since the earliest work since
> Nelson first did the RPC in 1981 is that "seamless upgradability" is
> usually an unrealised vision, especially if things like serialized
> java/spark objects are part of the payload.
>
> if it is a goal, then the tests to validate the versioning would have to
> be a key deliverable. examples: test modules using old versions,
>
> This is particularly a risk with a design which proposes serialising
> logical plans; it may be hard to change planning in future.
>
> Will the protocol include something similar to the DXL plan language
> implemented in Greenplum's orca query optimizer? That's an
> under-appreciated piece of work. If the goal of the protocol is to be long
> lived, it is a design worth considering, not just for its portability but
> because it lets people work on query optimisation as a service.
>
>
In the prototype I've built I'm not actually using the fully specified
logical plans that Spark is using for the query execution before
optimization, but rather something that is closer to the parse plans of a
SQL query. The parse plans follow more closely the relational algebra and
are much less likely to change compared to the actual underlying logical
plan operator. The goal is not to build an endpoint that can receive
optimized plans and directly executes these plans.

For example, all attributes in the plans are referenced as unresolved
attributes and the same is true for functions. This delegates the
responsibility for name resolution etc to the existing implementation that
we're not going to touch instead of trying to replicate it. It is still
possible to provide early feedback to the user because one can always
analyze the specific sub-plan.

Please let me know what you think.


>
> [1]. Orca: A Modular Query Optimizer Architecture for Big Data
>
>  
> https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf
> 
>
>
>> Spark Connect will strengthen Spark’s position as the modern unified
>> engine for large-scale data analytics and expand applicability to use cases
>> and developers we could not reach with the current setup: Spark will become
>> ubiquitously usable as the DataFrame API can be used with (almost) any
>> programming language.
>>
>> That's a marketing comment, not a technical one. best left out of ASF
> docs.
>


Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-07 Thread Steve Loughran
On Fri, 3 Jun 2022 at 18:46, Martin Grund
 wrote:

> Hi Everyone,
>
> We would like to start a discussion on the "Spark Connect" proposal.
> Please find the links below:
>
> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
> *SPIP Document* -
> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>
> *Excerpt from the document: *
>
> We propose to extend Apache Spark by building on the DataFrame API and the
> underlying unresolved logical plans. The DataFrame API is widely used and
> makes it very easy to iteratively express complex logic. We will introduce
> Spark Connect, a remote option of the DataFrame API that separates the
> client from the Spark server. With Spark Connect, Spark will become
> decoupled, allowing for built-in remote connectivity: The decoupled client
> SDK can be used to run interactive data exploration and connect to the
> server for DataFrame operations.
>
> Spark Connect will benefit Spark developers in different ways: The
> decoupled architecture will result in improved stability, as clients are
> separated from the driver. From the Spark Connect client perspective, Spark
> will be (almost) versionless, and thus enable seamless upgradability, as
> server APIs can evolve without affecting the client API. The decoupled
> client-server architecture can be leveraged to build close integrations
> with local developer tooling. Finally, separating the client process from
> the Spark server process will improve Spark’s overall security posture by
> avoiding the tight coupling of the client inside the Spark runtime
> environment.
>

one key finding on distributed systems since the earliest work since Nelson
first did the RPC in 1981 is that "seamless upgradability" is usually an
unrealised vision, especially if things like serialized java/spark objects
are part of the payload.

if it is a goal, then the tests to validate the versioning would have to be
a key deliverable. examples: test modules using old versions,

This is particularly a risk with a design which proposes serialising
logical plans; it may be hard to change planning in future.

Will the protocol include something similar to the DXL plan language
implemented in Greenplum's orca query optimizer? That's an
under-appreciated piece of work. If the goal of the protocol is to be long
lived, it is a design worth considering, not just for its portability but
because it lets people work on query optimisation as a service.


[1]. Orca: A Modular Query Optimizer Architecture for Big Data
 
https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf



> Spark Connect will strengthen Spark’s position as the modern unified
> engine for large-scale data analytics and expand applicability to use cases
> and developers we could not reach with the current setup: Spark will become
> ubiquitously usable as the DataFrame API can be used with (almost) any
> programming language.
>
> That's a marketing comment, not a technical one. best left out of ASF docs.


Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-07 Thread Martin Grigorov
Hi,

[X] +1 Release this package as Apache Spark 3.3.0

Tested:
- make local distribution from sources (with ./dev/make-distribution.sh
--tgz --name with-volcano -Pkubernetes,volcano,hadoop-3)
- create a Docker image (with JDK 11)
- run Pi example on
-- local
-- Kubernetes with default scheduler
-- Kubernetes with Volcano scheduler

On both x86_64 and aarch64 !

Regards,
Martin

On Sat, Jun 4, 2022 at 5:50 PM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 8th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc5 (commit
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> https://github.com/apache/spark/tree/v3.3.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1406
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>