[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

shivaram Tue, 24 Nov 2015 12:59:33 -0800

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9613#discussion_r45793856
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -2199,3 +2199,97 @@ setMethod("coltypes",
     
                 rTypes
               })
    +
    +#' Display the structure of a DataFrame, including column names, column 
types, as well as a
    +#' a small sample of rows.
    +#' @name str
    +#' @title Compactly display the structure of a dataset
    +#' @rdname str
    +#' @family DataFrame functions
    +#' @param object a DataFrame
    +#' @examples \dontrun{
    +#' # Create a DataFrame from the Iris dataset
    +#' irisDF <- createDataFrame(sqlContext, iris)
    +#' 
    +#' # Show the structure of the DataFrame
    +#' str(irisDF)
    +#' }
    +setMethod("str",
    +          signature(object = "DataFrame"),
    +          function(object) {
    +
    +            # TODO: These could be made global parameters, though in R 
it's not the case
    +            MAX_CHAR_PER_ROW <- 120
    +            MAX_COLS <- 100
    +
    +            # Get the column names and types of the DataFrame
    +            names <- names(object)
    +            types <- coltypes(object)
    +
    +            # Get the number of rows.
    +            # TODO: Ideally, this should be cached
    +            cachedCount <- nrow(object)
    +
    +            # Get the first elements of the dataset. Limit number of 
columns accordingly
    +            dataFrame <- if (ncol(object) > MAX_COLS) {
    +                           head(object[, c(1:MAX_COLS)])
    +                         } else {
    +                           head(object)
    +                         }
    +
    +            # The number of observations will be displayed only if the 
number
    +            # of rows of the dataset has already been cached.
    +            if (!is.null(cachedCount)) {
    --- End diff --
    
    Can we add this logic at that point then ? It seems to be unnecessarily 
complicating the code here.
    cc @rxin @davies Does the Scala layer cache the number of rows somewhere 
after a query evaluation ? In general it looks like it'll be good to know if an 
operation will be expensive or cheap at some high level.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

Reply via email to