Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1041#discussion_r13721122
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -296,16 +296,25 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
        * TODO: We only support primitive types, add support for nested types.
        */
       private[sql] def inferSchema(rdd: RDD[Map[String, _]]): SchemaRDD = {
    +    import scala.collection.JavaConversions._
    +    def typeFor(obj: Any): DataType = obj match {
    +      case c: java.lang.String => StringType
    +      case c: java.lang.Integer => IntegerType
    +      case c: java.lang.Long => LongType
    +      case c: java.lang.Double => DoubleType
    +      case c: java.lang.Boolean => BooleanType
    +      case c: java.util.List[_] => ArrayType(typeFor(c.toList.head))
    +      case c: java.util.Set[_] => ArrayType(typeFor(c.toList.head))
    +      case c: java.util.Map[_, _] =>
    --- End diff --
    
    I haven't looked too deeply, my initial understanding is that Structs are 
for user-defined Objects (if we are going to support them in Python), while 
MapType should suffice if we access dictionaries only as maps. Specifically, 
    
    1) Struct stores its fields in order, which makes it usable as schema for 
the outermost dict, as we need to match schema with data stored in an array for 
each row. Speaking of ordering, how can we be sure when we map dictionary 
values to an array, they are all in the same order? 
    ```
        val rowRdd = rdd.mapPartitions { iter =>
          iter.map { map =>
            new GenericRow(map.values.toArray.asInstanceOf[Array[Any]]): Row
          }
        }
    ```  
    
    2) Structs can only be used for dicts when the keys are Strings, right?
    
    3) I don't know enough about access pattern to tell if it is safe to label 
Set as ArrayType since Set is unordered?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to