In addition to that:

For now some stateful operations in structured streaming don't have equivalent 
python API, e.g. flatMapGroupsWithState. However spark engineers are making it 
possible in the upcoming version. See more: 
https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html



Best Regards!
...........................................................................
Lingzhe Sun 
Hirain Technology / APIC
 
From: Mich Talebzadeh
Date: 2022-11-03 19:15
To: Joris Billen
CC: User
Subject: Re: should one every make a spark streaming job in pyspark
Well your mileage varies so to speak.

Spark itself is written in Scala. However, that does not imply you should stick 
with Scala.
I have used both for spark streaming and spark structured streaming, they both 
work fine
PySpark has become popular with the widespread use of iData Science projects
What matters normally is the skill set you already have in-house. The 
likelihood is that there are more Python developers than Scala developers and 
the learning curve for scala has to be taken into account
The idea of performance etc is tangential.
 With regard to the Spark code itself, there should be little efforts in 
converting from Scala to PySpark or vice-versa
HTH

   view my Linkedin profile

  
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Wed, 2 Nov 2022 at 08:54, Joris Billen <joris.bil...@bigindustries.be> wrote:
Dear community, 
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I 
believe however that things can be implemented in pyspark. My question: 
1)is it completely dumb to make a streaming job in pyspark? 
2)what are the technical reasons that it is done best in scala (is this easy to 
understand why)? 
3)any good links anyone has seen with numbers of the difference in performance 
and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe 
when someone doesn’t feel confortable writing a job in scala and the number of 
messages/minute aren’t gigantic so performance isnt that crucial?)

Thanks for any input!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to