hi everyone, I've recently gotten moving on solving some of the low-level data interoperability problems between Python's NumPy-focused scientific computing and data libraries like pandas and the rest of the big data ecosystem, Spark being a very important part of that.
One of the major efforts here is creating a unified data access layer for pandas users using Apache Arrow as the structured data exchange medium (read more here: http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow "thunderbolt port" (to make an analogy) to Spark for moving data from Spark SQL to pandas much more efficiently than the current serialization scheme. If anyone wants to be a partner in crime on this, feel free to reach out! I'll be dropping the Arrow memory<->pandas conversion code in the next couple weeks. As I'm looking more at the implementation details and API design of PySpark, I note that it has been intended to have near 1-1 parity with the Scala API, enabling developers to jump between APIs without a lot of cognitive dissonance (you lose type information in Python, but c'est la vie). Much of PySpark appears to be wrapping Scala / Java API calls with py4j (much as many Python libraries wrap C/C++ libraries in an analogous fashion). In the long run, I'm concerned this may become problematic as users' expectations about the semantics of interacting with the data may not be compatible with the behavior of the Spark Scala API (particularly the API design and semantics of Spark SQL and Datasets). As the Spark user base grows, so, too, will the user needs, particularly in the more accessible APIs (Python / R). I expect the Scala users tend to be a more sophisticated audience with a more software engineering / computer science tilt. With a "big picture" goal of bringing about a semantic convergence between big data and small data in a certain subset of scalable computations, I am curious what is the Spark development community's attitude towards efforts to achieve 1-1 PySpark API parity (with a slight API lag as new features show up strictly in Scala before in Python), particularly in the strictly semantic realm of data interactions (at the end of the day, code has to move around bits someplace). Here is an illustrative, albeit somewhat trivial example of what I'm talking about: https://issues.apache.org/jira/browse/SPARK-13943 If closer semantic compatibility with existing software in R and Python is not a high priority, that is a completely reasonable answer. Another thought is treating PySpark as the place where the "rubber meets the road" -- the point of contact for any Python developers building applications with Spark. This would leave library developers aiming to create higher level user experiences (e.g. emulating pandas more closely) and thus use PySpark as an implementation tool that users otherwise do not directly interact with. But this is seemingly at odds with the efforts to make Spark DataFrames behave in an pandas/R-like fashion. The nearest analogue to this I would give is the relationship between pandas and NumPy in the earlier days of pandas (version 0.7 and earlier). pandas relies on NumPy data structures and many of its array algorithms. Early on I was lightly criticized in the community for creating pandas as a separate project rather than contributing patches to NumPy, but over time it has proven to have been the right decision, as domain specific needs can evolve in a decoupled way without onerous API design compromises. very best, Wes --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org