Hi Wes, Thanks for the email. It is difficult to generalize without seeing a lot more cases, but the boolean issue is simply a query analysis rule.
I can see us having a config option that changes analysis to match more Python/R like, which changes the behavior of implicit type coercion and allows boolean to integral automatically. On Thursday, March 17, 2016, Wes McKinney <w...@cloudera.com> wrote: > hi everyone, > > I've recently gotten moving on solving some of the low-level data > interoperability problems between Python's NumPy-focused scientific > computing and data libraries like pandas and the rest of the big data > ecosystem, Spark being a very important part of that. > > One of the major efforts here is creating a unified data access layer > for pandas users using Apache Arrow as the structured data exchange > medium (read more here: > http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created > https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow > "thunderbolt port" (to make an analogy) to Spark for moving data from > Spark SQL to pandas much more efficiently than the current > serialization scheme. If anyone wants to be a partner in crime on > this, feel free to reach out! I'll be dropping the Arrow > memory<->pandas conversion code in the next couple weeks. > > As I'm looking more at the implementation details and API design of > PySpark, I note that it has been intended to have near 1-1 parity with > the Scala API, enabling developers to jump between APIs without a lot > of cognitive dissonance (you lose type information in Python, but > c'est la vie). Much of PySpark appears to be wrapping Scala / Java API > calls with py4j (much as many Python libraries wrap C/C++ libraries in > an analogous fashion). > > In the long run, I'm concerned this may become problematic as users' > expectations about the semantics of interacting with the data may not > be compatible with the behavior of the Spark Scala API (particularly > the API design and semantics of Spark SQL and Datasets). As the Spark > user base grows, so, too, will the user needs, particularly in the > more accessible APIs (Python / R). I expect the Scala users tend to be > a more sophisticated audience with a more software engineering / > computer science tilt. > > With a "big picture" goal of bringing about a semantic convergence > between big data and small data in a certain subset of scalable > computations, I am curious what is the Spark development community's > attitude towards efforts to achieve 1-1 PySpark API parity (with a > slight API lag as new features show up strictly in Scala before in > Python), particularly in the strictly semantic realm of data > interactions (at the end of the day, code has to move around bits > someplace). Here is an illustrative, albeit somewhat trivial example > of what I'm talking about: > > https://issues.apache.org/jira/browse/SPARK-13943 > > If closer semantic compatibility with existing software in R and > Python is not a high priority, that is a completely reasonable answer. > > Another thought is treating PySpark as the place where the "rubber > meets the road" -- the point of contact for any Python developers > building applications with Spark. This would leave library developers > aiming to create higher level user experiences (e.g. emulating pandas > more closely) and thus use PySpark as an implementation tool that > users otherwise do not directly interact with. But this is seemingly > at odds with the efforts to make Spark DataFrames behave in an > pandas/R-like fashion. > > The nearest analogue to this I would give is the relationship between > pandas and NumPy in the earlier days of pandas (version 0.7 and > earlier). pandas relies on NumPy data structures and many of its array > algorithms. Early on I was lightly criticized in the community for > creating pandas as a separate project rather than contributing patches > to NumPy, but over time it has proven to have been the right decision, > as domain specific needs can evolve in a decoupled way without onerous > API design compromises. > > very best, > Wes > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: dev-h...@spark.apache.org <javascript:;> > >