[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

olarayej Thu, 15 Oct 2015 13:24:53 -0700

Github user olarayej commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8984#discussion_r42175264
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1880,4 +1880,46 @@ setMethod("as.data.frame",
                   stop(paste("Unused argument(s): ", paste(list(...), 
collapse=", ")))
                 }
                 collect(x)
    +          }
    +)
    +
    +#' Returns the column types of a DataFrame.
    +#' 
    +#' @name coltypes
    +#' @title Get column types of a DataFrame
    +#' @param x (DataFrame)
    +#' @return value (character) A character vector with the column types of 
the given DataFrame
    +#' @rdname coltypes
    +setMethod("coltypes",
    +          signature(x = "DataFrame"),
    +          function(x) {
    +            # TODO: This may be moved as a global parameter
    +            # These are the supported data types and how they map to
    +            # R's data types
    +            DATA_TYPES <- c("string"="character",
    +                            "long"="integer",
    +                            "tinyint"="integer",
    +                            "short"="integer",
    +                            "integer"="integer",
    +                            "byte"="integer",
    +                            "double"="numeric",
    +                            "float"="numeric",
    +                            "decimal"="numeric",
    +                            "boolean"="logical"
    +            )
    --- End diff --
    
    @sun-rui @shivaram 
    The notion of coltypes is actually spread in three files: schema.R, 
serialize.R, deserialize.R.
    
    In file serialize.R, method writeType (see below) turns the full data type 
into a one-character string. Then, method readTypedObject (see below), uses 
this one-character type to read accordingly. I suspect this is because complex 
types could be like map<String,String>? 
    
    In my opinion, it would be better to use the full data type, as opposed to 
the first letter (which could be especially confusing since we support data 
types starting with the same letter Date/Double, String/Struct). Also, having 
the full data type would allow for centralizing the data types in one place, 
though this would require some major changes
    
    We could have mapping arrays:
    
    PRIMITIVE_TYPES <- c("string"="character",
    +                            "long"="integer",
    +                            "tinyint"="integer",
    +                            "short"="integer",
    +                            "integer"="integer",
    +                            "byte"="integer",
    +                            "double"="numeric",
    +                            "float"="numeric",
    +                            "decimal"="numeric",
    +                            "boolean"="logical"
    
    COMPLEX_TYPES  <- c("map", "array", "struct", ...)
    
    DATA_TYPES <- c(PRIMITIVE_TYPES, COMPLEX_TYPES)
    
    
    And then we'd need to modify deserialize.R, serialize.R, and schema.R to 
acknowledge these accordingly.
    
    Thoughts?
    
    writeType <- function(con, class) {
      type <- switch(class,
                     NULL = "n",
                     integer = "i",
                     character = "c",
                     logical = "b",
                     double = "d",
                     numeric = "d",
                     raw = "r",
                     array = "a",
                     list = "l",
                     struct = "s",
                     jobj = "j",
                     environment = "e",
                     Date = "D",
                     POSIXlt = "t",
                     POSIXct = "t",
                     stop(paste("Unsupported type for serialization", class)))
      writeBin(charToRaw(type), con)
    }
    
    readTypedObject <- function(con, type) {
      switch (type,
        "i" = readInt(con),
        "c" = readString(con),
        "b" = readBoolean(con),
        "d" = readDouble(con),
        "r" = readRaw(con),
        "D" = readDate(con),
        "t" = readTime(con),
        "a" = readArray(con),
        "l" = readList(con),
        "e" = readEnv(con),
        "s" = readStruct(con),
        "n" = NULL,
        "j" = getJobj(readString(con)),
        stop(paste("Unsupported type for deserialization", type)))
    }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

Reply via email to