Github user olarayej commented on a diff in the pull request:
https://github.com/apache/spark/pull/8984#discussion_r42175264
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1880,4 +1880,46 @@ setMethod("as.data.frame",
stop(paste("Unused argument(s): ", paste(list(...),
collapse=", ")))
}
collect(x)
+ }
+)
+
+#' Returns the column types of a DataFrame.
+#'
+#' @name coltypes
+#' @title Get column types of a DataFrame
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the column types of
the given DataFrame
+#' @rdname coltypes
+setMethod("coltypes",
+ signature(x = "DataFrame"),
+ function(x) {
+ # TODO: This may be moved as a global parameter
+ # These are the supported data types and how they map to
+ # R's data types
+ DATA_TYPES <- c("string"="character",
+ "long"="integer",
+ "tinyint"="integer",
+ "short"="integer",
+ "integer"="integer",
+ "byte"="integer",
+ "double"="numeric",
+ "float"="numeric",
+ "decimal"="numeric",
+ "boolean"="logical"
+ )
--- End diff --
@sun-rui @shivaram
The notion of coltypes is actually spread in three files: schema.R,
serialize.R, deserialize.R.
In file serialize.R, method writeType (see below) turns the full data type
into a one-character string. Then, method readTypedObject (see below), uses
this one-character type to read accordingly. I suspect this is because complex
types could be like map<String,String>?
In my opinion, it would be better to use the full data type, as opposed to
the first letter (which could be especially confusing since we support data
types starting with the same letter Date/Double, String/Struct). Also, having
the full data type would allow for centralizing the data types in one place,
though this would require some major changes
We could have mapping arrays:
PRIMITIVE_TYPES <- c("string"="character",
+ "long"="integer",
+ "tinyint"="integer",
+ "short"="integer",
+ "integer"="integer",
+ "byte"="integer",
+ "double"="numeric",
+ "float"="numeric",
+ "decimal"="numeric",
+ "boolean"="logical"
COMPLEX_TYPES <- c("map", "array", "struct", ...)
DATA_TYPES <- c(PRIMITIVE_TYPES, COMPLEX_TYPES)
And then we'd need to modify deserialize.R, serialize.R, and schema.R to
acknowledge these accordingly.
Thoughts?
writeType <- function(con, class) {
type <- switch(class,
NULL = "n",
integer = "i",
character = "c",
logical = "b",
double = "d",
numeric = "d",
raw = "r",
array = "a",
list = "l",
struct = "s",
jobj = "j",
environment = "e",
Date = "D",
POSIXlt = "t",
POSIXct = "t",
stop(paste("Unsupported type for serialization", class)))
writeBin(charToRaw(type), con)
}
readTypedObject <- function(con, type) {
switch (type,
"i" = readInt(con),
"c" = readString(con),
"b" = readBoolean(con),
"d" = readDouble(con),
"r" = readRaw(con),
"D" = readDate(con),
"t" = readTime(con),
"a" = readArray(con),
"l" = readList(con),
"e" = readEnv(con),
"s" = readStruct(con),
"n" = NULL,
"j" = getJobj(readString(con)),
stop(paste("Unsupported type for deserialization", type)))
}
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]