Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN
resource manager. YARN  is the most widely used resource manager not just
for Spark but for other artefacts as well. On-premise YARN is used
extensively. In Cloud it is also used widely in Infrastructure as a Service
such as Google Dataproc which I mentioned.

With regard to your questions:

Q1: What are the causes and reasons for Spark on K8s to be slower than
Serverful?
--> It should be noted that Spark on Kubernetes is work in progress and as
of now there is future work outstanding.  It is not in parity with Spark on
Yarn

Q2: How or is there a scenario to show the most apparent difference in
performance and cost of these two environments (Serverless (K8S) and
Serverful (Traditional server)?
--> Simple. One experiment is worth 10 hypothesis  Install spark on
serverful and install spark on k8s and run the same workload and observer
the performance through SPARK GUI for the same workload

See this article of mine to help you with some features. A bit dated but
still covers concepts

Spark on Kubernetes, A Practitioner’s Guide


HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 27 Jul 2023 at 18:20, Trường Trần Phan An 
wrote:

> Hi all,
>
> I am learning about the performance difference of Spark when performing a
> JOIN problem on Serverless (K8S) and Serverful (Traditional server)
> environments.
>
> Through experiment, Spark on K8s tends to run slower than Serverful.
> Through understanding the architecture, I know that Spark runs on K8s as
> Containers (Pods) so it takes a certain time to initialize, but when I look
> at each job, stage, and task, Spark K8s tends to be slower. Serverful.
>
> *I have some questions:*
> Q1: What are the causes and reasons for Spark on K8s to be slower than
> Serverful?
> Q2: How or is there a scenario to show the most apparent difference in
> performance and cost of these two environments (Serverless (K8S) and
> Serverful (Traditional server)?
>
> Thank you so much!
>
> Best regards,
> Truong
>
>
>


Unsubscribe

2023-07-27 Thread Kevin Wang
Unsubscribe please!


The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Trường Trần Phan An
Hi all,

I am learning about the performance difference of Spark when performing a
JOIN problem on Serverless (K8S) and Serverful (Traditional server)
environments.

Through experiment, Spark on K8s tends to run slower than Serverful.
Through understanding the architecture, I know that Spark runs on K8s as
Containers (Pods) so it takes a certain time to initialize, but when I look
at each job, stage, and task, Spark K8s tends to be slower. Serverful.

*I have some questions:*
Q1: What are the causes and reasons for Spark on K8s to be slower than
Serverful?
Q2: How or is there a scenario to show the most apparent difference in
performance and cost of these two environments (Serverless (K8S) and
Serverful (Traditional server)?

Thank you so much!

Best regards,
Truong


Unsubscribe

2023-07-27 Thread blaz stojanovic
Unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2023-07-27 Thread Sherif Eid
Unsubscribe


Dynamic allocation does not deallocate executors

2023-07-27 Thread Sergei Zhgirovski
Hi everyone

I'm trying to use pyspark 3.3.2.
I have these relevant options set:


spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.shuffleTracking.timeout=20s
spark.dynamicAllocation.executorIdleTimeout=30s
spark.dynamicAllocation.cachedExecutorIdleTimeout=40s
spark.executor.instances=0
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.maxExecutors=20
spark.master=k8s://https://k8s-api.<>:6443


So I'm using kubernetes to deploy up to 20 executors

then I run this piece of code:

df = spark.read.parquet("s3a://")
print(df.count())
time.sleep(999)


This works fine and as expected: during the execution ~1600 tasks are
completed, 20 executors get deployed and are being quickly removed after
the calculation is complete.

Next, I add these to the config:

spark.decommission.enabled=true
spark.storage.decommission.shuffleBlocks.enabled=true
spark.storage.decommission.enabled=true
spark.storage.decommission.rddBlocks.enabled=true


I repeat the experiment on an empty kubernetes cluster, so that no actual
pod evicting is occuring.

This time executors deallocation is not working as expected: depending on
the run, after the job is complete, 0-3 executors out of 20 remain present
forever and never seem to get removed.

I tried to debug the code and found out that inside the
'ExecutorMonitor.timedOutExecutors' function those executors that never get
to be removed do not make it to the 'timedOutExecs' variable, because the
property 'hasActiveShuffle' remains 'true' for them.

I'm a little stuck here trying to understand how all pod management,
shuffle tracking and decommissioning were supposed to be working together,
how to debug this and whether this is an expected behavior at all (to me it
is not).

Thank you for any hints!


[ANNOUNCE] Apache Celeborn(incubating) 0.3.0 available

2023-07-27 Thread zhongqiang chen
Hi all,

Apache Celeborn(Incubating) community is glad to announce the
new release of Apache Celeborn(Incubating) 0.3.0

Celeborn is dedicated to improving the efficiency and elasticity of
different map-reduce engines and provides an elastic, high-efficient
service for intermediate data including shuffle data, spilled data,
result data, etc.
Currently Celeborn supports Both Spark And Flink with a unified service.

Now Flink-1.14/Flink-1.15/Flink-1.17 are all supproted.

Download Link: https://celeborn.apache.org/download/

GitHub Release Tag:
- https://github.com/apache/incubator-celeborn/releases/tag/v0.3.0-incubating

Release Notes:
- https://celeborn.apache.org/community/release_notes/release_note_0.3.0

Website: https://celeborn.apache.org/

Celeborn Resources:
- Issue: https://issues.apache.org/jira/projects/CELEBORN
- Mailing list: d...@celeborn.apache.org

Zhongqiang Chen
On behalf of the Apache Celeborn(incubating) community


Unsubscribe

2023-07-27 Thread Sherif Eid
Unsubscribe


Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello,

when you said your pandas Dataframe has 10 rows, does that mean it contains
10 images? Because if that's the case, then you'd want ro only use 3 layers
of ArrayType when you define the schema.

Best regards,
Adrian



On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID
 wrote:

> i have panda dataframe with column 'image' using numpy.ndarray. shape is (500,
> 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10,
> 500, 333, 3)
>
> when using spark.createDataframe(panda_dataframe, schema), i need to
> specify the schema,
>
> schema = StructType([
> StructField("image",
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), nullable=False)
> ])
>
>
> i get error
>
> raise TypeError(
> , TypeError: field image: 
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), True), True), True), 
> True) can not accept object array([[[14, 14, 14],
>
> ...
>
> Can advise how to set schema for image with numpy.ndarray ?
>
>
>
>


conver panda image column to spark dataframe

2023-07-27 Thread second_co...@yahoo.com.INVALID
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 
333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500, 
333, 3)
when using spark.createDataframe(panda_dataframe, schema), i need to specify 
the schema, 

schema = StructType([
    StructField("image", 
ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), nullable=False)
    ])

i get error
raise TypeError(
, TypeError: field image: 
ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), True), True), True), 
True) can not accept object array([[[14, 14, 14],...
Can advise how to set schema for image with numpy.ndarray ?




Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific
modification.

On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID
 wrote:

> I ran the following code
>
> spark.sparkContext.list_packages()
>
> on spark 3.4.1 and i get below error
>
> An error was encountered:
> AttributeError
> [Traceback (most recent call last):
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 113, in exec
> self._exec_then_eval(code)
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 106, in _exec_then_eval
> exec(compile(last, '', 'single'), self.globals)
> ,   File "", line 1, in 
> , AttributeError: 'SparkContext' object has no attribute 'list_packages'
> ]
>
>
> Is list_packages and install_pypi_package available for vanilla spark or
> only available for AWS services?
>
>
> Thank you
>