[ https://issues.apache.org/jira/browse/SPARK-9999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15022735#comment-15022735 ]
Nicholas Chammas commented on SPARK-9999: ----------------------------------------- [~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will provide no performance advantages over DataFrames, and that they will just help in terms of catching type errors early? {quote} Python and R are dynamically typed so can't take advantage of these. {quote} I can't speak for R, but Python as supported type hints since 3.0. More recently, Python 3.5 introduced a [typing module|https://docs.python.org/3/library/typing.html#module-typing] to standardize how type hints are specified, which facilitates the use of static type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer a statically type checked API, but practically speaking it would have to be limited to Python 3+. I suppose people don't generally expect static type checking when they use Python, so perhaps it makes sense not to support Datasets in PySpark. > Dataset API on top of Catalyst/DataFrame > ---------------------------------------- > > Key: SPARK-9999 > URL: https://issues.apache.org/jira/browse/SPARK-9999 > Project: Spark > Issue Type: Story > Components: SQL > Reporter: Reynold Xin > Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org