Re: ASF board report draft for August

2021-08-10 Thread Matei Zaharia
Good point, I’ll make sure to include that.

> On Aug 9, 2021, at 9:20 PM, Mridul Muralidharan  wrote:
> 
> Hi Matei,
> 
>   3.2 will also include support for pushed based shuffle (spip SPARK-30602).
> 
> Regards,
> Mridul
> 
> On Mon, Aug 9, 2021 at 9:26 PM Hyukjin Kwon  > wrote:
> > Are you referring to what version of Koala project? 1.8.1?
> 
> Yes, the latest version 1.8.1.
> 
> 2021년 8월 10일 (화) 오전 11:07, Igor Costa  >님이 작성:
> Hi Matei, nice update
> 
> 
> Just one question, when you mention “ We are working on Spark 3.2.0 as our 
> next release, with a release candidate likely to come soon. Spark 3.2 
> includes a new Pandas API for Apache Spark based on the Koalas project”
> 
> 
> Are you referring to what version of Koala project? 1.8.1?
> 
> 
> 
> Cheers
> Igor 
> 
> On Tue, 10 Aug 2021 at 13:31, Matei Zaharia  > wrote:
> It’s time for our quarterly report to the ASF board, which we need to send 
> out this Wednesday. I wrote the draft below based on community activity — let 
> me know if you’d like to add or change anything:
> 
> ==
> 
> Description:
> 
> Apache Spark is a fast and general engine for large-scale data processing. It 
> offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich 
> set of libraries including stream processing, machine learning, and graph 
> analytics.
> 
> Issues for the board:
> 
> - None
> 
> Project status:
> 
> - We made a number of maintenance releases in the past three months. We 
> released Apache Spark 3.1.2 and 3.0.3 in June as maintenance releases for the 
> 3.x branches. We also released Apache Spark 2.4.8 on May 17 as a bug fix 
> release for the Spark 2.x line. This may be the last release on 2.x unless 
> major new bugs are found.
> 
> - We added three PMC members: Liang-Chi Hsieh, Kousuke Saruta and Takeshi 
> Yamamuro.
> 
> - We are working on Spark 3.2.0 as our next release, with a release candidate 
> likely to come soon. Spark 3.2 includes a new Pandas API for Apache Spark 
> based on the Koalas project, a RocksDB state store for Structured Streaming, 
> native support for session windows, error message standardization, and 
> significant improvements to Spark SQL, such as the use of adaptive query 
> execution by default.
> 
> Trademarks:
> 
> - No changes since the last report.
> 
> Latest releases:
> 
> - Spark 3.1.2 was released on June 23rd, 2021.
> - Spark 3.0.3 was released on June 1st, 2021.
> - Spark 2.4.8 was released on May 17th, 2021.
> 
> Committers and PMC:
> 
> - The latest committers were added on March 11th, 2021 (Atilla Zsolt Piros, 
> Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
> - The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).
> 
> 
> 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> -- 
> Sent from Gmail Mobile



Re: Spark 3.2.0 first RC next week

2021-08-10 Thread Min Shen
Hi Gengliang,

SPARK-36378 (Switch to using RPCResponse to communicate common block push
failures to the client) should be another one.
This introduces a slight protocol change to push-based shuffle to improve
code robustness and performance, and is almost ready to be committed.
Because of the protocol change, it’s best to include it with 3.2.0 release.

Best,
Min

On Tue, Aug 10, 2021 at 01:13 Gengliang Wang  wrote:

> Hi all,
>
> As of now, there are still some open/in-progress blockers for Spark 3.2.0
> release:
>
>- Prohibit update mode in native support of session window (SPARK-36463
>)
>- Avoid inlining non-deterministic With-CTEs(SPARK-36447
>)
>- Data Source V2: Remove read specific distributions(SPARK-33807
>)
>- Support fetching shuffle blocks in batch with i/o encryption(
>SPARK-34827 )
>- Add a new Maven profile "no-shaded-client" for older Hadoop 3.x
>versions(SPARK-35959
>)
>- Review and fix issues in API docs(SPARK-34185
>)
>- Introduce the RocksDBStateStoreProvider in the programming guide(
>SPARK-36041 )
>- Push-based shuffle documentation(SPARK-36374
>)
>
> Thus, I propose to cut RC1 next week after all the blockers are resolved.
> If there are any other blockers, please reply to this email.
>
> Thanks
> Gengliang
>


Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov
Hi Mich

I think you need to check your code.
If code does not use PySpark API effectively you may get this. I.e. if you
use pure Python/pandas api rather than Pyspark i.e.
transform->transform->action. e.g df.select(..).withColumn(...)...count()

Hope this helps to put you on right direction.

Cheers
Khalid




On Mon, 9 Aug 2021, 20:20 Mich Talebzadeh, 
wrote:

> Hi,
>
> I have a basic question to ask.
>
> I am running a Google k8s cluster (AKA GKE) with three nodes each having
> configuration below
>
> e2-standard-2 (2 vCPUs, 8 GB memory)
>
>
> spark-submit is launched from another node (actually a data proc single
> node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call
> this the launch node
>
> OK I know that the cluster is not much but Google was complaining about
> the launch node hitting 100% cpus. So I added two more cpus to it.
>
> It appears that despite using k8s as the computational cluster, the burden
> falls upon the launch node!
>
> The cpu utilisation for launch node shown below
>
> [image: image.png]
> The dip is when 2 more cpus were added to  it so it had to reboot. so
> around %70 usage
>
> The combined cpu usage for GKE nodes is shown below:
>
> [image: image.png]
>
> Never goes above 20%!
>
> I can see that the drive and executors as below:
>
> k get pods -n spark
> NAME READY   STATUSRESTARTS
>  AGE
> pytest-c958c97b2c52b6ed-driver   1/1 Running   0
> 69s
> randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1 Running   0
> 51s
> randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1 Running   0
> 51s
> randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1 Pending   0
> 51s
>
> It is a PySpark 3.1.1 image using java 8 and pushing random data generated
> into Google BigQuery data warehouse. The last executor (exec-3) seems to be
> just pending. The spark-submit is as below:
>
> spark-submit --verbose \
>--properties-file ${property_file} \
>--master k8s://https://$KUBERNETES_MASTER_IP:443 \
>--deploy-mode cluster \
>--name pytest \
>--conf
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
>--py-files $CODE_DIRECTORY/DSBQ.zip \
>--conf spark.kubernetes.namespace=$NAMESPACE \
>--conf spark.executor.memory=5000m \
>--conf spark.network.timeout=300 \
>--conf spark.executor.instances=3 \
>--conf spark.kubernetes.driver.limit.cores=1 \
>--conf spark.driver.cores=1 \
>--conf spark.executor.cores=1 \
>--conf spark.executor.memory=2000m \
>--conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
>--conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
>--conf spark.kubernetes.container.image=${IMAGEGCP} \
>--conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>--conf
> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>--conf
> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
> \
>--conf spark.sql.execution.arrow.pyspark.enabled="true" \
>$CODE_DIRECTORY/${APPLICATION}
>
> Aren't the driver and executors running on K8s cluster? So why is the
> launch node heavily used but k8s cluster is underutilized?
>
> Thanks
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Spark 3.2.0 first RC next week

2021-08-10 Thread Gengliang Wang
Hi all,

As of now, there are still some open/in-progress blockers for Spark 3.2.0
release:

   - Prohibit update mode in native support of session window (SPARK-36463
   )
   - Avoid inlining non-deterministic With-CTEs(SPARK-36447
   )
   - Data Source V2: Remove read specific distributions(SPARK-33807
   )
   - Support fetching shuffle blocks in batch with i/o encryption(
   SPARK-34827 )
   - Add a new Maven profile "no-shaded-client" for older Hadoop 3.x
   versions(SPARK-35959 )
   - Review and fix issues in API docs(SPARK-34185
   )
   - Introduce the RocksDBStateStoreProvider in the programming guide(
   SPARK-36041 )
   - Push-based shuffle documentation(SPARK-36374
   )

Thus, I propose to cut RC1 next week after all the blockers are resolved.
If there are any other blockers, please reply to this email.

Thanks
Gengliang