Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich,

I was just reading random questions on the user list when I noticed that you 
said:

On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh  wrote:

1) You are using monotonically_increasing_id(), which is not 
collision-resistant in distributed environments like Spark. Multiple hosts
   can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) 
for guaranteed uniqueness.


It’s my understanding that the *Spark* `monotonically_increasing_id()` function 
exists for the exact purpose of generating a collision-resistant unique id 
across nodes on different hosts.
We use it extensively for this purpose and have never encountered an issue.

Are we wrong or are you thinking of a different (not Spark) function?

Cheers,

Steve C




This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


unsubscribe

2024-04-30 Thread Wood Super
unsubscribe


unsubscribe

2024-04-30 Thread junhua . xie
unsubscribe


unsubscribe

2024-04-30 Thread Yoel Benharrous



Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto

Because Apache Spark does not build clusters. This configuration option is
specific to Databricks, with their managed Spark offering. It allows
Databricks to automatically determine an optimal number of shuffle
partitions for your workload.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 30 Apr 2024 at 11:51, second_co...@yahoo.com.INVALID
 wrote:

> May i know is
>
> spark.sql.shuffle.partitions=auto
>
> only available on Databricks? what about on vanilla Spark ? When i set
> this, it gives error need to put int.  Any open source library that auto
> find the best partition , block size for dataframe?
>
>
>


Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi,
In k8s the driver is responsible for executor creation. The likelihood of
your problem is that Insufficient memory allocated for executors in the K8s
cluster. Even with dynamic allocation, k8s won't  schedule executor pods if
there is not enough free memory to fulfill their resource requests.

My suggestions

   - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB
   or 3GB) to allow for multiple executors within available cluster memory.
   - Adjust Driver Pod Resources: Ensure the driver pod has enough memory
   to run Spark and manage executors.
   - Optimize Resource Management: Explore on-demand allocation or
   adjusting allocation granularity for better resource utilization. For
   example look at documents for Executor On-Demand Allocation
   (spark.executor.cores=0): and spark.dynamicAllocation.minExecutors &
   spark.dynamicAllocation.maxExecutors

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 30 Apr 2024 at 04:29, Tarun raghav 
wrote:

> Respected Sir/Madam,
> I am Tarunraghav. I have a query regarding spark on kubernetes.
>
> We have an eks cluster, within which we have spark installed in the pods.
> We set the executor memory as 1GB and set the executor instances as 2, I
> have also set dynamic allocation as true. So when I try to read a 3 GB CSV
> file or parquet file, it is supposed to increase the number of pods by 2.
> But the number of executor pods is zero.
> I don't know why executor pods aren't being created, even though I set
> executor instance as 2. Please suggest a solution for this.
>
> Thanks & Regards,
> Tarunraghav
>
>


spark.sql.shuffle.partitions=auto

2024-04-30 Thread second_co...@yahoo.com.INVALID
May i know is spark.sql.shuffle.partitions=auto only available on Databricks? 
what about on vanilla Spark ? When i set this, it gives error need to put int.  
Any open source library that auto find the best partition , block size for 
dataframe?