Re: PySpark API divergence + improving pandas interoperability

Reynold Xin Mon, 21 Mar 2016 09:59:41 -0700

Hi Wes,

Thanks for the email. It is difficult to generalize without seeing a lot
more cases, but the boolean issue is simply a query analysis rule.


I can see us having a config option that changes analysis to match more
Python/R like, which changes the behavior of implicit type coercion and
allows boolean to integral automatically.

On Thursday, March 17, 2016, Wes McKinney <w...@cloudera.com> wrote:

> hi everyone,
>
> I've recently gotten moving on solving some of the low-level data
> interoperability problems between Python's NumPy-focused scientific
> computing and data libraries like pandas and the rest of the big data
> ecosystem, Spark being a very important part of that.
>
> One of the major efforts here is creating a unified data access layer
> for pandas users using Apache Arrow as the structured data exchange
> medium (read more here:
> http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created
> https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow
> "thunderbolt port"  (to make an analogy) to Spark for moving data from
> Spark SQL to pandas much more efficiently than the current
> serialization scheme. If anyone wants to be a partner in crime on
> this, feel free to reach out! I'll be dropping the Arrow
> memory<->pandas conversion code in the next couple weeks.
>
> As I'm looking more at the implementation details and API design of
> PySpark, I note that it has been intended to have near 1-1 parity with
> the Scala API, enabling developers to jump between APIs without a lot
> of cognitive dissonance (you lose type information in Python, but
> c'est la vie). Much of PySpark appears to be wrapping Scala / Java API
> calls with py4j (much as many Python libraries wrap C/C++ libraries in
> an analogous fashion).
>
> In the long run, I'm concerned this may become problematic as users'
> expectations about the semantics of interacting with the data may not
> be compatible with the behavior of the Spark Scala API (particularly
> the API design and semantics of Spark SQL and Datasets). As the Spark
> user base grows, so, too, will the user needs, particularly in the
> more accessible APIs (Python / R). I expect the Scala users tend to be
> a more sophisticated audience with a more software engineering /
> computer science tilt.
>
> With a "big picture" goal of bringing about a semantic convergence
> between big data and small data in a certain subset of scalable
> computations, I am curious what is the Spark development community's
> attitude towards efforts to achieve 1-1 PySpark API parity (with a
> slight API lag as new features show up strictly in Scala before in
> Python), particularly in the strictly semantic realm of data
> interactions (at the end of the day, code has to move around bits
> someplace). Here is an illustrative, albeit somewhat trivial example
> of what I'm talking about:
>
> https://issues.apache.org/jira/browse/SPARK-13943
>
> If closer semantic compatibility with existing software in R and
> Python is not a high priority, that is a completely reasonable answer.
>
> Another thought is treating PySpark as the place where the "rubber
> meets the road" -- the point of contact for any Python developers
> building applications with Spark. This would leave library developers
> aiming to create higher level user experiences (e.g. emulating pandas
> more closely) and thus use PySpark as an implementation tool that
> users otherwise do not directly interact with. But this is seemingly
> at odds with the efforts to make Spark DataFrames behave in an
> pandas/R-like fashion.
>
> The nearest analogue to this I would give is the relationship between
> pandas and NumPy in the earlier days of pandas (version 0.7 and
> earlier). pandas relies on NumPy data structures and many of its array
> algorithms. Early on I was lightly criticized in the community for
> creating pandas as a separate project rather than contributing patches
> to NumPy, but over time it has proven to have been the right decision,
> as domain specific needs can evolve in a decoupled way without onerous
> API design compromises.
>
> very best,
> Wes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@spark.apache.org <javascript:;>
>
>

Re: PySpark API divergence + improving pandas interoperability

Reply via email to