PySpark API divergence + improving pandas interoperability

Wes McKinney Sat, 19 Mar 2016 09:51:39 -0700

hi everyone,

I've recently gotten moving on solving some of the low-level data
interoperability problems between Python's NumPy-focused scientific
computing and data libraries like pandas and the rest of the big data
ecosystem, Spark being a very important part of that.


One of the major efforts here is creating a unified data access layer
for pandas users using Apache Arrow as the structured data exchange
medium (read more here:
http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created
https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow
"thunderbolt port"  (to make an analogy) to Spark for moving data from
Spark SQL to pandas much more efficiently than the current
serialization scheme. If anyone wants to be a partner in crime on
this, feel free to reach out! I'll be dropping the Arrow
memory<->pandas conversion code in the next couple weeks.

As I'm looking more at the implementation details and API design of
PySpark, I note that it has been intended to have near 1-1 parity with
the Scala API, enabling developers to jump between APIs without a lot
of cognitive dissonance (you lose type information in Python, but
c'est la vie). Much of PySpark appears to be wrapping Scala / Java API
calls with py4j (much as many Python libraries wrap C/C++ libraries in
an analogous fashion).

In the long run, I'm concerned this may become problematic as users'
expectations about the semantics of interacting with the data may not
be compatible with the behavior of the Spark Scala API (particularly
the API design and semantics of Spark SQL and Datasets). As the Spark
user base grows, so, too, will the user needs, particularly in the
more accessible APIs (Python / R). I expect the Scala users tend to be
a more sophisticated audience with a more software engineering /
computer science tilt.

With a "big picture" goal of bringing about a semantic convergence
between big data and small data in a certain subset of scalable
computations, I am curious what is the Spark development community's
attitude towards efforts to achieve 1-1 PySpark API parity (with a
slight API lag as new features show up strictly in Scala before in
Python), particularly in the strictly semantic realm of data
interactions (at the end of the day, code has to move around bits
someplace). Here is an illustrative, albeit somewhat trivial example
of what I'm talking about:

https://issues.apache.org/jira/browse/SPARK-13943

If closer semantic compatibility with existing software in R and
Python is not a high priority, that is a completely reasonable answer.

Another thought is treating PySpark as the place where the "rubber
meets the road" -- the point of contact for any Python developers
building applications with Spark. This would leave library developers
aiming to create higher level user experiences (e.g. emulating pandas
more closely) and thus use PySpark as an implementation tool that
users otherwise do not directly interact with. But this is seemingly
at odds with the efforts to make Spark DataFrames behave in an
pandas/R-like fashion.

The nearest analogue to this I would give is the relationship between
pandas and NumPy in the earlier days of pandas (version 0.7 and
earlier). pandas relies on NumPy data structures and many of its array
algorithms. Early on I was lightly criticized in the community for
creating pandas as a separate project rather than contributing patches
to NumPy, but over time it has proven to have been the right decision,
as domain specific needs can evolve in a decoupled way without onerous
API design compromises.

very best,
Wes

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

PySpark API divergence + improving pandas interoperability

Reply via email to