Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful!

On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh 
wrote:

> *Potential reasons*
>
>
>- Data Serialization: Spark needs to serialize the DataFrame into an
>in-memory format suitable for storage. This process can be time-consuming,
>especially for large datasets like 3.2 GB with complex schemas.
>- Shuffle Operations: If your transformations involve shuffle
>operations, Spark might need to shuffle data across the cluster to ensure
>efficient storage. Shuffling can be slow, especially on large datasets or
>limited network bandwidth or nodes..  Check Spark UI staging and executor
>tabs for info on shuffle reads and writes
>- Memory Allocation: Spark allocates memory for the cached DataFrame.
>Depending on the cluster configuration and available memory, this
>allocation can take some time.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
>> >
>> > 
>> > Hello Folks,
>> > in Spark I have read a file and done some transformation and finally
>> writing to hdfs.
>> >
>> > Now I am interested in writing the same dataframe to MapRFS but for
>> this Spark will execute the full DAG again  (recompute all the previous
>> steps)(all the read + transformations ).
>> >
>> > I don't want this recompute again so I decided to cache() the dataframe
>> so that 2nd/nth write won't recompute all the steps .
>> >
>> > But here is a catch: the cache() takes more time to persist the data in
>> memory.
>> >
>> > I have a question when the dataframe is in memory then just to save it
>> to another space in memory , why it will take more time (3.2 G data 6 mins)
>> >
>> > May I know what operations in cache() are taking such a long time ?
>> >
>> > I would appreciate it if someone would share the information .
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons*


   - Data Serialization: Spark needs to serialize the DataFrame into an
   in-memory format suitable for storage. This process can be time-consuming,
   especially for large datasets like 3.2 GB with complex schemas.
   - Shuffle Operations: If your transformations involve shuffle
   operations, Spark might need to shuffle data across the cluster to ensure
   efficient storage. Shuffling can be slow, especially on large datasets or
   limited network bandwidth or nodes..  Check Spark UI staging and executor
   tabs for info on shuffle reads and writes
   - Memory Allocation: Spark allocates memory for the cached DataFrame.
   Depending on the cluster configuration and available memory, this
   allocation can take some time.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:

> Could any one help me here ?
> Sent from my iPhone
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> >
> > 
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
> Spark will execute the full DAG again  (recompute all the previous
> steps)(all the read + transformations ).
> >
> > I don't want this recompute again so I decided to cache() the dataframe
> so that 2nd/nth write won't recompute all the steps .
> >
> > But here is a catch: the cache() takes more time to persist the data in
> memory.
> >
> > I have a question when the dataframe is in memory then just to save it
> to another space in memory , why it will take more time (3.2 G data 6 mins)
> >
> > May I know what operations in cache() are taking such a long time ?
> >
> > I would appreciate it if someone would share the information .
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ?
Sent from my iPhone

> On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> 
> 
> Hello Folks,
> in Spark I have read a file and done some transformation and finally writing 
> to hdfs.
> 
> Now I am interested in writing the same dataframe to MapRFS but for this 
> Spark will execute the full DAG again  (recompute all the previous steps)(all 
> the read + transformations ).
> 
> I don't want this recompute again so I decided to cache() the dataframe so 
> that 2nd/nth write won't recompute all the steps .
> 
> But here is a catch: the cache() takes more time to persist the data in 
> memory.
> 
> I have a question when the dataframe is in memory then just to save it to 
> another space in memory , why it will take more time (3.2 G data 6 mins)
> 
> May I know what operations in cache() are taking such a long time ?
> 
> I would appreciate it if someone would share the information .

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo
Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.

Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again  (recompute all the previous
steps)(all the read + transformations ).

I don't want this recompute again so I decided to cache() the dataframe so
that 2nd/nth write won't recompute all the steps .

But here is a catch: the cache() takes more time to persist the data in
memory.

I have a question when the dataframe is in memory then just to save it to
another space in memory , why it will take more time (3.2 G data 6 mins)

May I know what operations in cache() are taking such a long time ?

I would appreciate it if someone would share the information .