Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
on a flight to London for OSCON but I want to continueo encourage users to
chime in with their experiences (to that end I'm trying to re include user@
since it doesn't seem to have been posted there despite my initial attempt
to do so.)
On Thursday, October 13, 2016, assaf.mendelson <assaf.mendel...@rsa.com>
> We are actually using pyspark heavily.
> I agree with all of your points, for me I see the following as the main
> 1. Pyspark does not have support for UDAF. We have had multiple
> needs for UDAF and needed to go to java/scala to support these. Having
> python UDAF would have made life much easier (especially at earlier stages
> when we prototype).
> 2. Performance. I cannot stress this enough. Currently we have
> engineers who take python UDFs and convert them to scala UDFs for
> performance. We are currently even looking at writing UDFs and UDAFs in a
> more native way (e.g. using expressions) to improve performance but working
> with pyspark can be really problematic.
> BTW, other than using jython or arrow, I believe there are a couple of
> other ways to get improve performance:
> 1. Python provides tool to generate AST for python code (
> https://docs.python.org/2/library/ast.html). This means we can use the
> AST to construct scala code very similar to how expressions are build for
> native spark functions in scala. Of course doing full conversion is very
> hard but at least handling simple cases should be simple.
> 2. The above would of course be limited if we use python packages
> but over time it is possible to add some “translation” tools (i.e. take
> python packages and find the appropriate scala equivalent. We can even
> provide this to the user to supply their own conversions thereby looking as
> a regular python code but being converted to scala code behind the scenes).
> 3. In scala, it is possible to use codegen to actually generate
> code from a string. There is no reason why we can’t write the expression in
> python and provide a scala string. This would mean learning some scala but
> would mean we do not have to create a separate code tree.
> BTW, the fact that all of the tools to access java are marked as private
> has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are
> written in scala for performance. The wrapping to provide them in python
> uses way too many private elements for my taste.
> *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+
> *Sent:* Thursday, October 13, 2016 3:51 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Python Spark Improvements (forked from Spark Improvement
> As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all
> of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
> Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.
> Being a Python shop, we were extremely pleased to learn about PySpark a
> few years ago as our main ETL pipeline used Apache Pig at the time. I was
> one of the only folks who understood Pig and Java so collaborating on this
> as a team was difficult.
> Spark provided a means for the entire team to collaborate, but we've hit
> our fair share of issues all of which are enumerated in this thread.
> Besides giving a +1 here, I think if I were to force rank these items for
> us, it'd be:
> 1. Configuration difficulties: we've lost literally weeks to
> troubleshooting memory issues for larger jobs. It took a long time to even
> understand *why* certain jobs were failing since Spark would just report
> executors being lost. Finally we tracked things down to understanding that
> spark.yarn.executor.memoryOverhead controls the portion of memory
> reserved for Python processes, but none of this is documented anywhere as
> far as I can tell. We discovered this via trial and error. Both
> documentation and better defaults for this setting when running a PySpark
> application are probably sufficient. We've also had a number of troubles
> with saving Parquet output as part of an ETL flow, but perhaps we'll save
> that for a blog post of its own.
> 2. Dependency management: I've tried to help move the conversation on
> https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a
> bit stalled. Installing the required dependencies for a PySpark application
> is a really messy ordeal right now.
> 3. Development workflow: I'd combine both "incomprehensible error
> messages" and "
> difficulty using PySpark from outside of spark-submit / pyspark shell"
> here. When teaching PySpark to new users, I'm reminded of how much inside
> knowledge is needed to overcome esoteric errors. As one example is hitting
> "PicklingError: Could not pickle object as excessively deep recursion
> required." errors. New users often do something innocent like try to pickle
> a global logging object and hit this and begin the Google -> stackoverflow
> search to try to comprehend what's going on. You can lose days to errors
> like these and they completely kill the productivity flow and send you
> hunting for alternatives.
> 4. Speed/performance: we are trying to use DataFrame/DataSets where we can
> and do as much in Java as possible but when we do move to Python, we're
> well aware that we're about to take a hit on performance. We're very keen
> to see what Apache Arrow does for things here.
> 5. API difficulties: I agree that when coming from Python, you'd expect
> that you can do the same kinds of operations on DataFrames in Spark that
> you can with Pandas, but I personally haven't been too bothered by this.
> Maybe I'm more used to this situation from using other frameworks that have
> similar concepts but incompatible implementations.
> We're big fans of PySpark and are happy to provide feedback and contribute
> wherever we can.
> *If you reply to this email, your message will be added to the discussion
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19431&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> View this message in context: RE: Python Spark Improvements (forked from
> Spark Improvement Proposals)
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
Cell : 425-233-8271