As very heavy Spark users at, I just wanted to give a +1 to all of
the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
Canada on PySpark

Being a Python shop, we were extremely pleased to learn about PySpark a few
years ago as our main ETL pipeline used Apache Pig at the time. I was one of
the only folks who understood Pig and Java so collaborating on this as a
team was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for
us, it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting
memory issues for larger jobs. It took a long time to even understand *why*
certain jobs were failing since Spark would just report executors being
lost. Finally we tracked things down to understanding that
spark.yarn.executor.memoryOverhead controls the portion of memory reserved
for Python processes, but none of this is documented anywhere as far as I
can tell. We discovered this via trial and error. Both documentation and
better defaults for this setting when running a PySpark application are
probably sufficient. We've also had a number of troubles with saving Parquet
output as part of an ETL flow, but perhaps we'll save that for a blog post
of its own.

2. Dependency management: I've tried to help move the conversation on but it seems we're a bit
stalled. Installing the required dependencies for a PySpark application is a
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages"
and "
difficulty using PySpark from outside of spark-submit / pyspark shell" here.
When teaching PySpark to new users, I'm reminded of how much inside
knowledge is needed to overcome esoteric errors. As one example is hitting
"PicklingError: Could not pickle object as excessively deep recursion
required." errors. New users often do something innocent like try to pickle
a global logging object and hit this and begin the Google -> stackoverflow
search to try to comprehend what's going on. You can lose days to errors
like these and they completely kill the productivity flow and send you
hunting for alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can
and do as much in Java as possible but when we do move to Python, we're well
aware that we're about to take a hit on performance. We're very keen to see
what Apache Arrow does for things here.

5. API difficulties: I agree that when coming from Python, you'd expect that
you can do the same kinds of operations on DataFrames in Spark that you can
with Pandas, but I personally haven't been too bothered by this. Maybe I'm
more used to this situation from using other frameworks that have similar
concepts but incompatible implementations.

We're big fans of PySpark and are happy to provide feedback and contribute
wherever we can.

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe e-mail:

Reply via email to