Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful!

On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh 
wrote:

> *Potential reasons*
>
>
>- Data Serialization: Spark needs to serialize the DataFrame into an
>in-memory format suitable for storage. This process can be time-consuming,
>especially for large datasets like 3.2 GB with complex schemas.
>- Shuffle Operations: If your transformations involve shuffle
>operations, Spark might need to shuffle data across the cluster to ensure
>efficient storage. Shuffling can be slow, especially on large datasets or
>limited network bandwidth or nodes..  Check Spark UI staging and executor
>tabs for info on shuffle reads and writes
>- Memory Allocation: Spark allocates memory for the cached DataFrame.
>Depending on the cluster configuration and available memory, this
>allocation can take some time.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
>> >
>> > 
>> > Hello Folks,
>> > in Spark I have read a file and done some transformation and finally
>> writing to hdfs.
>> >
>> > Now I am interested in writing the same dataframe to MapRFS but for
>> this Spark will execute the full DAG again  (recompute all the previous
>> steps)(all the read + transformations ).
>> >
>> > I don't want this recompute again so I decided to cache() the dataframe
>> so that 2nd/nth write won't recompute all the steps .
>> >
>> > But here is a catch: the cache() takes more time to persist the data in
>> memory.
>> >
>> > I have a question when the dataframe is in memory then just to save it
>> to another space in memory , why it will take more time (3.2 G data 6 mins)
>> >
>> > May I know what operations in cache() are taking such a long time ?
>> >
>> > I would appreciate it if someone would share the information .
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our
release processes?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:

> On that note, GitHub recently released (public preview) a new feature
> called Artifact Attestions which may be relevant/useful here: Introducing
> Artifact Attestations–now in public beta - The GitHub Blog
> 
>
> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>
>> I have no permissions so I can't do it but I'm happy to help (although I
>> am more familiar with Gitlab CICD than Github Actions).
>> Is there some point of contact that can provide me needed context and
>> permissions?
>> I'd also love to see why the costs are high and see how we can reduce
>> them...
>>
>> Thanks,
>> Nimrod
>>
>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>> wrote:
>>
>>> I think signing the artifacts produced from a secure CI sounds like a
>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>> perhaps someone interested could volunteer to set that up.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>> wrote:
>>>
 Hi,
 Thanks for the reply.

 From my experience, a build on a build server would be much more
 predictable and less error prone than building on some laptop- and of
 course much faster to have builds, snapshots, release candidates, early
 previews releases, release candidates or final releases.
 It will enable us to have a preview version with current changes-
 snapshot version, either automatically every day or if we need to save
 costs (although build is really not expensive) - with a click of a button.

 Regarding keys for signing. - that's what vaults are for, all across
 the industry we are using vaults (such as hashicorp vault)- but if the
 build will be automated and the only thing which will be manual is to sign
 the release for security reasons that would be reasonable.

 Thanks,
 Nimrod


 בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
 holden.ka...@gmail.com>:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
> wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>> wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and
>>> get it ready for the release process (docker desktop doesn't work 
>>> anymore,
>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>> Thanks
>>> for your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01 
>>> wrote:
>>>
 +1



 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li <
 gatorsm...@gmail.com>, Tathagata Das ,
 Wenchen Fan , Cheng Pan ,
 Nicholas Chammas , Dongjoon Hyun <
 dongjoon.h...@gmail.com>, Cheng Pan , Spark
 dev list , Anish Shrigondekar <
 anish.shrigonde...@databricks.com>
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release



 +1 love to see it!



 On Thu, May 2, 2024 at 10:08 AM Holden Karau <
 holden.ka...@gmail.com> wrote:

 +1 :) yay previews



 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1



 On Wed, May 1, 2024 at 5:23 PM Xiao Li 
 wrote:

 +1 for next Monday.



 We can do more previews when the other features are ready for
 preview.



 Tathagata Das  于2024年5月1日周三 08:46写道:

 Next week sounds 

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen
On that note, GitHub recently released (public preview) a new feature
called Artifact Attestions which may be relevant/useful here: Introducing
Artifact Attestations–now in public beta - The GitHub Blog


On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:

> I have no permissions so I can't do it but I'm happy to help (although I
> am more familiar with Gitlab CICD than Github Actions).
> Is there some point of contact that can provide me needed context and
> permissions?
> I'd also love to see why the costs are high and see how we can reduce
> them...
>
> Thanks,
> Nimrod
>
> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
> wrote:
>
>> I think signing the artifacts produced from a secure CI sounds like a
>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>> perhaps someone interested could volunteer to set that up.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>>
>>> Hi,
>>> Thanks for the reply.
>>>
>>> From my experience, a build on a build server would be much more
>>> predictable and less error prone than building on some laptop- and of
>>> course much faster to have builds, snapshots, release candidates, early
>>> previews releases, release candidates or final releases.
>>> It will enable us to have a preview version with current changes-
>>> snapshot version, either automatically every day or if we need to save
>>> costs (although build is really not expensive) - with a click of a button.
>>>
>>> Regarding keys for signing. - that's what vaults are for, all across the
>>> industry we are using vaults (such as hashicorp vault)- but if the build
>>> will be automated and the only thing which will be manual is to sign the
>>> release for security reasons that would be reasonable.
>>>
>>> Thanks,
>>> Nimrod
>>>
>>>
>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>> holden.ka...@gmail.com>:
>>>
 Indeed. We could conceivably build the release in CI/CD but the final
 verification / signing should be done locally to keep the keys safe (there
 was some concern from earlier release processes).

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
 wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually
> from a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
> wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get
>> it ready for the release process (docker desktop doesn't work anymore, my
>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>> for your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>>> Chammas , Dongjoon Hyun <
>>> dongjoon.h...@gmail.com>, Cheng Pan , Spark
>>> dev list , Anish Shrigondekar <
>>> anish.shrigonde...@databricks.com>
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for
>>> preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>> wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about 
>>> we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> 

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons*


   - Data Serialization: Spark needs to serialize the DataFrame into an
   in-memory format suitable for storage. This process can be time-consuming,
   especially for large datasets like 3.2 GB with complex schemas.
   - Shuffle Operations: If your transformations involve shuffle
   operations, Spark might need to shuffle data across the cluster to ensure
   efficient storage. Shuffling can be slow, especially on large datasets or
   limited network bandwidth or nodes..  Check Spark UI staging and executor
   tabs for info on shuffle reads and writes
   - Memory Allocation: Spark allocates memory for the cached DataFrame.
   Depending on the cluster configuration and available memory, this
   allocation can take some time.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:

> Could any one help me here ?
> Sent from my iPhone
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> >
> > 
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
> Spark will execute the full DAG again  (recompute all the previous
> steps)(all the read + transformations ).
> >
> > I don't want this recompute again so I decided to cache() the dataframe
> so that 2nd/nth write won't recompute all the steps .
> >
> > But here is a catch: the cache() takes more time to persist the data in
> memory.
> >
> > I have a question when the dataframe is in memory then just to save it
> to another space in memory , why it will take more time (3.2 G data 6 mins)
> >
> > May I know what operations in cache() are taking such a long time ?
> >
> > I would appreciate it if someone would share the information .
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ?
Sent from my iPhone

> On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> 
> 
> Hello Folks,
> in Spark I have read a file and done some transformation and finally writing 
> to hdfs.
> 
> Now I am interested in writing the same dataframe to MapRFS but for this 
> Spark will execute the full DAG again  (recompute all the previous steps)(all 
> the read + transformations ).
> 
> I don't want this recompute again so I decided to cache() the dataframe so 
> that 2nd/nth write won't recompute all the steps .
> 
> But here is a catch: the cache() takes more time to persist the data in 
> memory.
> 
> I have a question when the dataframe is in memory then just to save it to 
> another space in memory , why it will take more time (3.2 G data 6 mins)
> 
> May I know what operations in cache() are taking such a long time ?
> 
> I would appreciate it if someone would share the information .

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am
more familiar with Gitlab CICD than Github Actions).
Is there some point of contact that can provide me needed context and
permissions?
I'd also love to see why the costs are high and see how we can reduce
them...

Thanks,
Nimrod

On Wed, May 8, 2024 at 8:26 AM Holden Karau  wrote:

> I think signing the artifacts produced from a secure CI sounds like a good
> idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
> someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across the
>> industry we are using vaults (such as hashicorp vault)- but if the build
>> will be automated and the only thing which will be manual is to sign the
>> release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the final
>>> verification / signing should be done locally to keep the keys safe (there
>>> was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
 Hi,

 Sorry for the novice question, Wenchen - the release is done manually
 from a laptop? Not using a CI CD process on a build server?

 Thanks,
 Nimrod

 On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get
> it ready for the release process (docker desktop doesn't work anymore, my
> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
> for your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>> Chammas , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Cheng Pan , Spark dev
>> list , Anish Shrigondekar <
>> anish.shrigonde...@databricks.com>
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道:
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>> wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We
>> don't need to wait for all the ongoing projects to be ready. How about we
>> do a 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard 
>> to
>> do that without a Preview release. So the sooner we make a Preview 
>> release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon