date:20201010

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha

I have one observation: is "python udf is slow due to deserialization
penulty" still relevant? Even after arrow is used as in memory data mgmt
and so heavy investment from spark dev community on making pandas first
class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people
are most imp driver for this choice regardless the use case. Use case is
relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta 
wrote:

> Not quite sure how meaningful this discussion is, but in case someone is
> really faced with this query the question still is 'what is the use case'?
> I am just a bit confused with the one size fits all deterministic approach
> here thought that those days were over almost 10 years ago.
> Regards
> Gourav
>
> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>
>> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>>   I wrote pipelines/frameworks for large companies and scala was a much
>> better choice. But for ad-hoc work interfacing directly with data science
>> experiments pyspark presents less friction.
>>
>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
>> wrote:
>>
>>> Many thanks everyone for their valuable contribution.
>>>
>>> We all started with Spark a few years ago where Scala was the talk
>>> of the town. I agree with the note that as long as Spark stayed nish and
>>> elite, then someone with Scala knowledge was attracting premiums. In
>>> fairness in 2014-2015, there was not much talk of Data Science input (I may
>>> be wrong). But the world has moved on so to speak. Python itself has been
>>> around a long time (long being relative here). Most people either knew UNIX
>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>> director a few years ago who asked our Hadoop admin for root password to
>>> log in to the edge node. Later he became head of machine learning
>>> somewhere else and he loved C and Python. So Python was a gift in disguise.
>>> I think Python appeals to those who are very familiar with CLI and shell
>>> programming (Not GUI fan). As some members alluded to there are more people
>>> around with Python knowledge. Most managers choose Python as the unifying
>>> development tool because they feel comfortable with it. Frankly I have not
>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>>
>>> Disclaimer: These are opinions and not facts so to speak :)
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>>> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich

 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.

>>> --
Best Regards,
Ayan Guha

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta

Not quite sure how meaningful this discussion is, but in case someone is
really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach
here thought that those days were over almost 10 years ago.
Regards
Gourav

On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>   I wrote pipelines/frameworks for large companies and scala was a much
> better choice. But for ad-hoc work interfacing directly with data science
> experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk of the
>> town. I agree with the note that as long as Spark stayed nish and elite,
>> then someone with Scala knowledge was attracting premiums. In fairness in
>> 2014-2015, there was not much talk of Data Science input (I may be wrong).
>> But the world has moved on so to speak. Python itself has been around
>> a long time (long being relative here). Most people either knew UNIX Shell,
>> C, Python or Perl or a combination of all these. I recall we had a director
>> a few years ago who asked our Hadoop admin for root password to log in to
>> the edge node. Later he became head of machine learning somewhere else and
>> he loved C and Python. So Python was a gift in disguise. I think Python
>> appeals to those who are very familiar with CLI and shell programming (Not
>> GUI fan). As some members alluded to there are more people around with
>> Python knowledge. Most managers choose Python as the unifying development
>> tool because they feel comfortable with it. Frankly I have not seen a
>> manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch

I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk of the
> town. I agree with the note that as long as Spark stayed nish and elite,
> then someone with Scala knowledge was attracting premiums. In fairness in
> 2014-2015, there was not much talk of Data Science input (I may be wrong).
> But the world has moved on so to speak. Python itself has been around
> a long time (long being relative here). Most people either knew UNIX Shell,
> C, Python or Perl or a combination of all these. I recall we had a director
> a few years ago who asked our Hadoop admin for root password to log in to
> the edge node. Later he became head of machine learning somewhere else and
> he loved C and Python. So Python was a gift in disguise. I think Python
> appeals to those who are very familiar with CLI and shell programming (Not
> GUI fan). As some members alluded to there are more people around with
> Python knowledge. Most managers choose Python as the unifying development
> tool because they feel comfortable with it. Frankly I have not seen a
> manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh

Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the
town. I agree with the note that as long as Spark stayed nish and elite,
then someone with Scala knowledge was attracting premiums. In fairness in
2014-2015, there was not much talk of Data Science input (I may be wrong).
But the world has moved on so to speak. Python itself has been around
a long time (long being relative here). Most people either knew UNIX Shell,
C, Python or Perl or a combination of all these. I recall we had a director
a few years ago who asked our Hadoop admin for root password to log in to
the edge node. Later he became head of machine learning somewhere else and
he loved C and Python. So Python was a gift in disguise. I think Python
appeals to those who are very familiar with CLI and shell programming (Not
GUI fan). As some members alluded to there are more people around with
Python knowledge. Most managers choose Python as the unifying development
tool because they feel comfortable with it. Frankly I have not seen a
manager who feels at home with Scala. So in summary it is a bit
disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,

Mich

On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka

I would not leave it to data scientists unless they will maintain it.

The key decision in cases I've seen was usually people
cost/availability with ETL operations cost taken into account.

Often the situation is that ETL cloud cost is small and you will not
save much. Then it is just skills cost/availability.
For Python skills you pay less and you can pick people with other
useful skills and also you can more easily train people you have
internally.

Often you have some simple ETL scripts before moving to spark and
these scripts are usually written in Python.

Best Regards,

Jacek


sob., 10 paź 2020 o 12:32 Jörn Franke  napisał(a):
>
> It really depends on what your data scientists talk. I don’t think it makes 
> sense for ad hoc data science things to impose a language on them, but let 
> them choose.
> For more complex AI engineering things you can though apply different 
> standards and criteria. And then it really depends on architecture aspects 
> etc.
>
> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
>
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Meaning of terms spark as computing engine and spark cluster

2020-10-10 Thread Santosh74

Is spark compute engine only or it's also cluster which comes with set of
hardware /nodes ? What exactly is spark clusterr? 

Dear Experts please help



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark as computing engine vs spark cluster

2020-10-10 Thread Santosh74

Is spark compute engine only or it's also cluster which comes with set of
hardware /nodes  ? What exactly is spark clusterr? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke

It really depends on what your data scientists talk. I don’t think it makes 
sense for ad hoc data science things to impose a language on them, but let them 
choose.
For more complex AI engineering things you can though apply different standards 
and criteria. And then it really depends on architecture aspects etc.

> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
> 
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
> 
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
> 
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
> 
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
> 
> Many thanks,
> 
> Mich
> 
> 
> 
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>

Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon

If it works without Arrow optimization, it's likely a bug. Please feel free
to file a JIRA for that.

On Wed, 7 Oct 2020, 22:44 Jacek Pliszka,  wrote:

> Hi!
>
> Is there any place I can find information how to use gapply with arrow?
>
> I've tried something very simple
>
> collect(gapply(
>   df,
>   c("ColumnA"),
>   function(key, x){
>   data.frame(out=c("dfs"), stringAsFactors=FALSE)
>   },
>   "out String"
> ))
>
> But it fails - similar code with integers or double works fine.
>
> [Fetched stdout timeout] Error in readBin(con, raw(),
> as.integer(dataLen), endian = "big") : invalid 'n' argument
>
> java.lang.UnsupportedOperationException at
>
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
> at
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
> at
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>  ...
>
> When I looked at the source code there - it is all stubs.
>
> Is there a proper way to use arrow in gapply in SparkR?
>
> BR,
>
> Jacel
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven

Hey Mich,

This is a very fair question .. I've seen many data engineering teams start
out with Scala because technically it is the best choice for many given
reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science
use cases where people mostly do python. So, if you need those two worlds
collaborate and even handover code, you don't want the ideological battle
of Scala vs Python. We chose python for the sake of everybody speaking the
same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin
layer around everything on the JVM. Even the discussion of Python UDFs
don't hold up. If it works as a Python function (and most of the time it
does) why do Scala? If however, performance characteristics show you
otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are
much more rare ... a build tool like Maven for Java or SBT for Scala don't
exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the
many data science use cases out there, I see many Spark shops prefer python
over Scala because Spark gravitates to dataframes where the downsides of
Python do not stack up. Performance of python as a driver program which is
just the glue code, becomes irrelevant compared to the processing you are
doing on the JVM. We even notice that Python is much easier and we hear
echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If
you only do data engineering, Scala can be the better choice. It depends on
the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> 
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta

What is the use case?
Unless you have unlimited funding and time to waste you would usually start
with that.

Regards,
Gourav

On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer 
wrote:

> Spark in Scala (or java) Is much more performant if you are using RDD's,
> those operations basically force you to pass lambdas, hit serialization
> between java and python types and yes hit the Global Interpreter Lock. But,
> none of those things apply to Data Frames which will generate Java code
> regardless of what language you use to describe the Dataframe operations as
> long as you don't use python lambdas. A Dataframe operation without python
> lambdas should not require any remote python code execution.
>
> TLDR, If you are using Dataframes it doesn't matter if you use Scala,
> Java, Python, R, SQL, the planning and work will all happen in the JVM.
>
> As for a repl, you can run PySpark which will start up a repl. There are
> also a slew of notebooks which provide interactive python environments as
> well.
>
>
> On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh 
> wrote:
>
>> Thanks
>>
>> So ignoring Python lambdas is it a matter of individuals familiarity with
>> the language that is the most important factor? Also I have noticed that
>> Spark document preferences have been switched from Scala to Python as the
>> first example. However, some codes for example JDBC calls are the same for
>> Scala and Python.
>>
>> Some examples like this website
>> 
>> claim that Scala performance is an order of magnitude better than Python
>> and also when it comes to concurrency Scala is a better choice. Maybe it is
>> pretty old (2018)?
>>
>> Also (and may be my ignorance I have not researched it) does Spark offer
>> REPL in the form of spark-shell with Python?
>>
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
>> wrote:
>>
>>> As long as you don't use python lambdas in your Spark job there should
>>> be almost no difference between the Scala and Python dataframe code. Once
>>> you introduce python lambdas you will hit some significant serialization
>>> penalties as well as have to run actual work code in python. As long as no
>>> lambdas are used, everything will operate with Catalyst compiled java code
>>> so there won't be a big difference between python and scala.
>>>
>>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich




 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *





 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>

Re: Scala vs Python for ETL with Spark

Re: Scala vs Python for ETL with Spark

Re: Scala vs Python for ETL with Spark

Re: Scala vs Python for ETL with Spark

Re: Scala vs Python for ETL with Spark

Meaning of terms spark as computing engine and spark cluster

Spark as computing engine vs spark cluster

Re: Scala vs Python for ETL with Spark

Re: [SparkR] gapply with strings with arrow

Re: Scala vs Python for ETL with Spark

Re: Scala vs Python for ETL with Spark

11 matches

Site Navigation

Mail list logo

Footer information