Re: [VOTE] Release Spark 3.2.1 (RC1)

2022-01-15 Thread Dongjoon Hyun
Hi, Bjorn.

It seems that you are confused about my announcement. The test coverage
announcement is about the `master` branch which is for the upcoming Apache
Spark 3.3.0. Apache Spark 3.3 will start to support Java 17, not old
release branches like Apache Spark 3.2.x/3.1.x/3.0.x.

> 1. If I change the java version to 17 I did get an error which I did not
copy. But have you built this with java 11 or java 17? I have notis that we
test using java 17, so I was hoping to update java to version 17.

The Apache Spark community is still actively developing, stabilizing, and
optimizing Spark on Java 17. For the details, please see the following.

SPARK-33772: Build and Run Spark on Java 17
SPARK-35781: Support Spark on Apple Silicon on macOS natively on Java 17
SPARK-37593: Optimize HeapMemoryAllocator to avoid memory waste when using
G1GC

In short, please don't expect Java 17 with Spark 3.2.x and older versions.

Thanks,
Dongjoon.



On Sat, Jan 15, 2022 at 11:19 AM Bjørn Jørgensen 
wrote:

> 2. Things
>
> I did change the dockerfile from jupyter/docker-stacks to
> https://github.com/bjornjorgensen/docker-stacks/blob/master/pyspark-notebook/Dockerfile
> then I build, tag and push.
> And I start it with docker-compose like
>
> version: '2.1'
> services:
> jupyter:
> image: bjornjorgensen/spark-notebook:spark-3.2.1RC-1
> restart: 'no'
> volumes:
> - ./notebooks:/home/jovyan/notebooks
> ports:
> - "8881:"
> - "8181:8080"
> - "7077:7077"
> - "4040:4040"
> environment:
> NB_UID: ${UID}
> NB_GID: ${GID}
>
>
> 1. If I change the java version to 17 I did get an error which I did not
> copy. But have you built this with java 11 or java 17? I have notis that we
> test using java 17, so I was hoping to update java to version 17.
>
> 2.
>
> In a notebook I start spark by
>
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> #import pandas as pd
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField,
> StringType,IntegerType
>
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "KEY") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl",
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1")
>
> return
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>
> spark = get_spark_session("Falk", SparkConf())
>
> Then I run this code
>
> f06 =
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/f06.json")
>
> pf06 = f06.to_pandas_on_spark()
>
> pf06.info()
>
>
>
> And I did not get any errors or warnings. But acording to
> https://github.com/apache/spark/commit/bc7d55fc1046a55df61fdb380629699e9959fcc6
>
> (Spark)DataFrame.to_pandas_on_spark is deprecated.
>
> So I was supposed to get some info to change to pandas_api. Which I did
> not get.
>
>
>
>
>
> fre. 14. jan. 2022 kl. 07:04 skrev huaxin gao :
>
>> The two regressions have been fixed. I will cut RC2 tomorrow late
>> afternoon.
>>
>> Thanks,
>> Huaxin
>>
>> On Wed, Jan 12, 2022 at 9:11 AM huaxin gao 
>> wrote:
>>
>>> Thank you all for testing and voting!
>>>
>>> I will -1 this RC because
>>> https://issues.apache.org/jira/browse/SPARK-37855 and
>>> https://issues.apache.org/jira/browse/SPARK-37859 are regressions.
>>> These are not blockers but I think it's better to fix them in 3.2.1. I will
>>> prepare for RC2.
>>>
>>> Thanks,
>>> Huaxin
>>>
>>> On Wed, Jan 12, 2022 at 2:03 AM Kent Yao  wrote:
>>>
 +1 (non-binding).

 Chao Sun  于2022年1月12日周三 16:10写道:

> +1 (non-binding). Thanks Huaxin for driving the release!
>
> On Tue, Jan 11, 2022 at 11:56 PM Ruifeng Zheng 
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks, ruifeng zheng
>>
>> -- Original --
>> *From:* "Cheng Su" ;
>> *Date:* Wed, Jan 12, 2022 02:54 PM
>> *To:* "Qian Sun";"huaxin gao"<
>> huaxin.ga...@gmail.com>;
>> *Cc:* "dev";
>> *Subject:* Re: [VOTE] Release Spark 3.2.1 (RC1)
>>
>> +1 (non-binding). Checked commit history and ran some local tests.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> *From: *Qian Sun 

Re: [VOTE] Release Spark 3.2.1 (RC1)

2022-01-15 Thread Bjørn Jørgensen
2. Things

I did change the dockerfile from jupyter/docker-stacks to
https://github.com/bjornjorgensen/docker-stacks/blob/master/pyspark-notebook/Dockerfile
then I build, tag and push.
And I start it with docker-compose like

version: '2.1'
services:
jupyter:
image: bjornjorgensen/spark-notebook:spark-3.2.1RC-1
restart: 'no'
volumes:
- ./notebooks:/home/jovyan/notebooks
ports:
- "8881:"
- "8181:8080"
- "7077:7077"
- "4040:4040"
environment:
NB_UID: ${UID}
NB_GID: ${GID}


1. If I change the java version to 17 I did get an error which I did not
copy. But have you built this with java 11 or java 17? I have notis that we
test using java 17, so I was hoping to update java to version 17.

2.

In a notebook I start spark by

from pyspark import pandas as ps
import re
import numpy as np
import os
#import pandas as pd

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
from pyspark.sql.types import StructType, StructField,
StringType,IntegerType

os.environ["PYARROW_IGNORE_TIMEZONE"]="1"

def get_spark_session(app_name: str, conf: SparkConf):
conf.setMaster('local[*]')
conf \
  .set('spark.driver.memory', '64g')\
  .set("fs.s3a.access.key", "minio") \
  .set("fs.s3a.secret.key", "KEY") \
  .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
  .set("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .set("spark.hadoop.fs.s3a.path.style.access", "true") \
  .set("spark.sql.repl.eagerEval.enabled", "True") \
  .set("spark.sql.adaptive.enabled", "True") \
  .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
  .set("spark.sql.repl.eagerEval.maxNumRows", "1")

return
SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()

spark = get_spark_session("Falk", SparkConf())

Then I run this code

f06 =
spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/f06.json")

pf06 = f06.to_pandas_on_spark()

pf06.info()



And I did not get any errors or warnings. But acording to
https://github.com/apache/spark/commit/bc7d55fc1046a55df61fdb380629699e9959fcc6

(Spark)DataFrame.to_pandas_on_spark is deprecated.

So I was supposed to get some info to change to pandas_api. Which I did not
get.





fre. 14. jan. 2022 kl. 07:04 skrev huaxin gao :

> The two regressions have been fixed. I will cut RC2 tomorrow late
> afternoon.
>
> Thanks,
> Huaxin
>
> On Wed, Jan 12, 2022 at 9:11 AM huaxin gao  wrote:
>
>> Thank you all for testing and voting!
>>
>> I will -1 this RC because
>> https://issues.apache.org/jira/browse/SPARK-37855 and
>> https://issues.apache.org/jira/browse/SPARK-37859 are regressions. These
>> are not blockers but I think it's better to fix them in 3.2.1. I will
>> prepare for RC2.
>>
>> Thanks,
>> Huaxin
>>
>> On Wed, Jan 12, 2022 at 2:03 AM Kent Yao  wrote:
>>
>>> +1 (non-binding).
>>>
>>> Chao Sun  于2022年1月12日周三 16:10写道:
>>>
 +1 (non-binding). Thanks Huaxin for driving the release!

 On Tue, Jan 11, 2022 at 11:56 PM Ruifeng Zheng 
 wrote:

> +1 (non-binding)
>
> Thanks, ruifeng zheng
>
> -- Original --
> *From:* "Cheng Su" ;
> *Date:* Wed, Jan 12, 2022 02:54 PM
> *To:* "Qian Sun";"huaxin gao"<
> huaxin.ga...@gmail.com>;
> *Cc:* "dev";
> *Subject:* Re: [VOTE] Release Spark 3.2.1 (RC1)
>
> +1 (non-binding). Checked commit history and ran some local tests.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *Qian Sun 
> *Date: *Tuesday, January 11, 2022 at 7:55 PM
> *To: *huaxin gao 
> *Cc: *dev 
> *Subject: *Re: [VOTE] Release Spark 3.2.1 (RC1)
>
> +1
>
>
>
> Looks good. All integration tests passed.
>
>
>
> Qian
>
>
>
> 2022年1月11日 上午2:09,huaxin gao  写道:
>
>
>
> Please vote on releasing the following candidate as Apache Spark
> version 3.2.1.
>
>
> The vote is open until Jan. 13th at 12 PM PST (8 PM UTC) and passes if
> a majority
>
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
>
> [ ] +1 Release this package as Apache Spark 3.2.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 3.2.1 (try project = SPARK AND
> "Target Version/s" = "3.2.1" AND status in (Open, Reopened, "In
> Progress"))
>
> The tag to be voted on is v3.2.1-rc1 (commit
> 2b0ee226f8dd17b278ad11139e62464433191653):
>
> https://github.com/apache/spark/tree/v3.2.1-rc1
>
> The release files, including signatures, digests, etc. can be found