[GitHub] spark pull request: [SPARK-13593][SQL] improve the `toDF()` method...

yhuai Mon, 07 Mar 2016 15:11:47 -0800

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11444#discussion_r55290303
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -50,18 +50,60 @@
     def _monkey_patch_RDD(sqlContext):
         def toDF(self, schema=None, sampleRatio=None):
             """
    -        Converts current :class:`RDD` into a :class:`DataFrame`
    +        Converts current :class:`RDD` into a :class:`DataFrame` according 
to the given schema.
    +        If the given schema is None or just column names, the schema will 
be inferred by scanning
    +        the data in this RDD according to the sampleRatio.
     
    -        This is a shorthand for ``sqlContext.createDataFrame(rdd, schema, 
sampleRatio)``
    +        Note that, the given schema must match the real data, or exception 
will be thrown at
    +        runtime. If the given schema is not StructType, it will be wrapped 
into a StructType as
    +        its only field, and the field name will be "value", each record 
will also be wrapped into a
    +        tuple, which can be converted to row later.
     
    -        :param schema: a StructType or list of names of columns
    +        :param schema: a :class:`DataType` or a datatype string or list of 
names of columns.
    +                       The data type string format equals to 
`DataType.simpleString`, except that
    +                       top level struct type can omit the `struct<>` and 
numeric types use
    +                       `typeName()` as their format, e.g. use `byte` 
instead of `tinyint` for
    +                       ByteType. We can also use `int` as a short name for 
IntegerType.
             :param samplingRatio: the sample ratio of rows used for inferring
             :return: a DataFrame
     
    -        >>> rdd.toDF().collect()
    +        .. versionchanged:: 2.0
    +           The schema parameter can be a DataType or a datatype string 
after 2.0. If it's not a
    +           StructType, it will be wrapped into a StructType and each 
record will also be wrapped
    +           into a tuple.
    +
    +        >>> rdd.toDF().collect()      # will scan the data and infer the 
real schema.
    +        [Row(name=u'Alice', age=1)]
    +        >>> rdd.toDF(["name", "age"]) # will scan the data and infer the 
real schema.
             [Row(name=u'Alice', age=1)]
    +        >>> rdd.toDF(["a", "b"])      # will scan the data and infer the 
real schema.
    +        [Row(a=u'Alice', b=1)]
    +        >>> rdd.toDF("a: string, b: int").collect()
    +        [Row(a=u'Alice', b=1)]
    +        >>> rdd.map(lambda row: row.age).toDF("int").collect()
    +        [Row(value=1)]
             """
    -        return sqlContext.createDataFrame(self, schema, sampleRatio)
    +        if schema is None or isinstance(schema, (list, tuple)):
    +            # if schema is going to be inferred, no need to verify it.
    +            return sqlContext.createDataFrame(self, schema, sampleRatio)
    +
    +        if isinstance(schema, basestring):
    +            schema = _parse_datatype_string(schema)
    +
    +        if not isinstance(schema, DataType):
    +            raise TypeError("schema should be DataType or string or list 
or tuple, " +
    +                            "but got: %s" % schema)
    +        else:
    +            def verify(obj):
    +                _verify_type(obj, schema)
    +                return obj
    +            rdd = self.map(verify)
    +
    +            if isinstance(schema, StructType):
    +                return sqlContext.createDataFrame(rdd, schema)
    +            else:
    +                return sqlContext.createDataFrame(rdd.map(lambda obj: 
(obj, )),
    --- End diff --
    
    Do we support arbitrary Python objects? I think we do not need to still 
limit the object types to Row, Tuple, Dict, and List, right? We can extract 
values as long as the object have the requested field.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13593][SQL] improve the `toDF()` method...

Reply via email to