Github user yhuai commented on a diff in the pull request:
https://github.com/apache/spark/pull/11444#discussion_r55290303
--- Diff: python/pyspark/sql/context.py ---
@@ -50,18 +50,60 @@
def _monkey_patch_RDD(sqlContext):
def toDF(self, schema=None, sampleRatio=None):
"""
- Converts current :class:`RDD` into a :class:`DataFrame`
+ Converts current :class:`RDD` into a :class:`DataFrame` according
to the given schema.
+ If the given schema is None or just column names, the schema will
be inferred by scanning
+ the data in this RDD according to the sampleRatio.
- This is a shorthand for ``sqlContext.createDataFrame(rdd, schema,
sampleRatio)``
+ Note that, the given schema must match the real data, or exception
will be thrown at
+ runtime. If the given schema is not StructType, it will be wrapped
into a StructType as
+ its only field, and the field name will be "value", each record
will also be wrapped into a
+ tuple, which can be converted to row later.
- :param schema: a StructType or list of names of columns
+ :param schema: a :class:`DataType` or a datatype string or list of
names of columns.
+ The data type string format equals to
`DataType.simpleString`, except that
+ top level struct type can omit the `struct<>` and
numeric types use
+ `typeName()` as their format, e.g. use `byte`
instead of `tinyint` for
+ ByteType. We can also use `int` as a short name for
IntegerType.
:param samplingRatio: the sample ratio of rows used for inferring
:return: a DataFrame
- >>> rdd.toDF().collect()
+ .. versionchanged:: 2.0
+ The schema parameter can be a DataType or a datatype string
after 2.0. If it's not a
+ StructType, it will be wrapped into a StructType and each
record will also be wrapped
+ into a tuple.
+
+ >>> rdd.toDF().collect() # will scan the data and infer the
real schema.
+ [Row(name=u'Alice', age=1)]
+ >>> rdd.toDF(["name", "age"]) # will scan the data and infer the
real schema.
[Row(name=u'Alice', age=1)]
+ >>> rdd.toDF(["a", "b"]) # will scan the data and infer the
real schema.
+ [Row(a=u'Alice', b=1)]
+ >>> rdd.toDF("a: string, b: int").collect()
+ [Row(a=u'Alice', b=1)]
+ >>> rdd.map(lambda row: row.age).toDF("int").collect()
+ [Row(value=1)]
"""
- return sqlContext.createDataFrame(self, schema, sampleRatio)
+ if schema is None or isinstance(schema, (list, tuple)):
+ # if schema is going to be inferred, no need to verify it.
+ return sqlContext.createDataFrame(self, schema, sampleRatio)
+
+ if isinstance(schema, basestring):
+ schema = _parse_datatype_string(schema)
+
+ if not isinstance(schema, DataType):
+ raise TypeError("schema should be DataType or string or list
or tuple, " +
+ "but got: %s" % schema)
+ else:
+ def verify(obj):
+ _verify_type(obj, schema)
+ return obj
+ rdd = self.map(verify)
+
+ if isinstance(schema, StructType):
+ return sqlContext.createDataFrame(rdd, schema)
+ else:
+ return sqlContext.createDataFrame(rdd.map(lambda obj:
(obj, )),
--- End diff --
Do we support arbitrary Python objects? I think we do not need to still
limit the object types to Row, Tuple, Dict, and List, right? We can extract
values as long as the object have the requested field.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]