[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-70450284 I've tested this PR but the result seems to be off. Parquet generated from Hive with timestamp values set by 'from_utc_timestamp('1970-01-01 08:00:00','PST')' What I see with this PR: scala t.take(10).foreach(println(_)) ... 15/01/18 22:06:41 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: file:/users/x/parquetwithtimestamp start: 0 end: 25448 length: 25448 hosts: [] requestedSchema: message root { optional binary code (UTF8); optional binary description (UTF8); optional int32 total_emp; optional int32 salary; optional int96 timestamp; } readSupportMetadata: {org.apache.spark.sql.parquet.row.metadata={type:struct,fields:[{name:code,type:string,nullable:true,metadata:{}},{name:description,type:string,nullable:true,metadata:{}},{name:total_emp,type:integer,nullable:true,metadata:{}},{name:salary,type:integer,nullable:true,metadata:{}},{name:timestamp,type:timestamp,nullable:true,metadata:{}}]}, org.apache.spark.sql.parquet.row.requested_schema={type:struct,fields:[{name:code,type:string,nullable:true,metadata:{}},{name:description,type:string,nullable:true,metadata:{}},{name:total_emp,type:integer,nullable:true,metadata:{}},{name:salary,type:integer,nullable:true,metadata:{}},{name:timestamp,type:timestamp,nullable:true,metadata:{}}]}}} 15/01/18 22:06:41 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 15/01/18 22:06:41 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 823 records. 15/01/18 22:06:41 INFO InternalParquetRecordReader: at row 0. reading next block 15/01/18 22:06:41 INFO CodecPool: Got brand-new decompressor [.snappy] 15/01/18 22:06:41 INFO InternalParquetRecordReader: block read in memory in 27 ms. row count = 823 [00-,All Occupations,134354250,40690,1974-01-07 17:58:00.08896] [11-,Management occupations,6003930,96150,1974-01-07 17:58:00.08896] Expect: 1970-01-01 08:00:00 Actual: 1974-01-07 17:58:00.08896 Any idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-70904105 Good to hear. Here's how I create my test data, I run this in Hive and then take the data from HDFS directly and Spark is able to read/parse the data file (with issue above): Set parquet.compression = SNAPPY; DROP TABLE testdata; CREATE TABLE testdata STORED AS PARQUET AS SELECT a.*, from_utc_timestamp('1970-01-01 08:00:00','PST') as timestamp FROM sample_07 AS a; I have looked into this a fair bit and attempted a fix. Thanks for working on this fix and let me know if I could help in any way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/3820#discussion_r23256480 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,15 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or /td /tr tr + tdcodespark.sql.parquet.int96AsTimestamp/code/td + tdtrue/td + td +Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also +store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This +flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. + /td +/tr --- End diff -- From my digging only Parquet-format 2.2 has the TIMESTAMP and TIMESTAMP_MILLS types. Cloudera is still on 1.5.0 Hive/Impala has been writing this INT96 nano sec format that's different. --- Original Message --- From: Michael Armbrust notificati...@github.com Sent: January 20, 2015 11:25 AM To: apache/spark sp...@noreply.github.com Cc: Felix Cheung felixcheun...@hotmail.com Subject: Re: [spark] [SPARK-4987] [SQL] parquet timestamp type support (#3820) @@ -581,6 +581,15 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or /td /tr tr + tdcodespark.sql.parquet.int96AsTimestamp/code/td + tdtrue/td + td +Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also +store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This +flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. + /td +/tr Yeah, I agree that it's weird though. Perhaps we should ask the parquet list why they don't support the int 96 version. On Jan 20, 2015 11:21 AM, Cheng Lian notificati...@github.com wrote: In docs/sql-programming-guide.md https://github.com/apache/spark/pull/3820#discussion-diff-23247492: @@ -581,6 +581,15 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or /td /tr tr + tdcodespark.sql.parquet.int96AsTimestamp/code/td + tdtrue/td + td +Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also +store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This +flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. + /td +/tr Oh, I see the difference here. Double checked, Parquet only provides TIMESTAMP and TIMESTAMP_MILLIS â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3820/files#r23247492. --- Reply to this email directly or view it on GitHub: https://github.com/apache/spark/pull/3820/files#r23247815 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5654] Integrate SparkR
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/5096#issuecomment-85816756 @redbaron, @oscaroboto The same applies to memory consumption I'm afraid. There isn't a way to constraint how much [R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html) (same for Python?) allocates (on Unix, 64-bit). However, it looks like in [Mesos](http://mesos.apache.org/documentation/latest/configuration/) and [YARN](http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/) this can be controlled via cgroups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8307] [SQL] improve timestamp from parq...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/6759#discussion_r32299296 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -498,69 +493,21 @@ private[parquet] object CatalystArrayConverter { } private[parquet] object CatalystTimestampConverter { - // TODO most part of this comes from Hive-0.14 - // Hive code might have some issues, so we need to keep an eye on it. - // Also we use NanoTime and Int96Values from parquet-examples. - // We utilize jodd to convert between NanoTime and Timestamp - val parquetTsCalendar = new ThreadLocal[Calendar] - def getCalendar: Calendar = { -// this is a cache for the calendar instance. -if (parquetTsCalendar.get == null) { - parquetTsCalendar.set(Calendar.getInstance(TimeZone.getTimeZone(GMT))) -} -parquetTsCalendar.get - } - val NANOS_PER_SECOND: Long = 10 - val SECONDS_PER_MINUTE: Long = 60 - val MINUTES_PER_HOUR: Long = 60 - val NANOS_PER_MILLI: Long = 100 + // see http://stackoverflow.com/questions/466321/convert-unix-timestamp-to-julian + val JULIAN_DAY_OF_EPOCH = 2440587.5 --- End diff -- if we generate parquet with hive query like this we could compare the timestamp value in Spark? ``` USE default; DROP TABLE timestamptable; CREATE TABLE timestamptable STORED AS PARQUET AS SELECT cast(from_unixtime(unix_timestamp()) as timestamp) as t, * FROM sample_07; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/6743#discussion_r33654496 --- Diff: core/src/main/scala/org/apache/spark/api/r/RUtils.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io.File + +import org.apache.spark.SparkException + +private[spark] object RUtils { + /** + * Get the SparkR package path in the local spark distribution. + */ + def localSparkRPackagePath: Option[String] = { +val sparkHome = sys.env.get(SPARK_HOME) +sparkHome.map( + Seq(_, R, lib).mkString(File.separator) +) + } + + /** + * Get the SparkR package path in various deployment modes. + */ + def sparkRPackagePath(driver: Boolean): String = { +val yarnMode = sys.env.get(SPARK_YARN_MODE) +if (!yarnMode.isEmpty yarnMode.get == true +!(driver System.getProperty(spark.master) == yarn-client)) { + // For workers in YARN modes and driver in yarn cluster mode, + // the SparkR package distributed as an archive resource should be pointed to + // by a symbol link sparkr in the current directory. + new File(sparkr).getAbsolutePath +} else { + // TBD: add support for MESOS + val rPackagePath = localSparkRPackagePath --- End diff -- doesn't seem like `localSparkRPackagePath` will ever return empty because of this `map` call? ``` sparkHome.map( Seq(_, R, lib).mkString(File.separator) ) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9317] [SPARKR] Change `show` to print D...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8360#issuecomment-133523560 From what I can infer from the original JIRA is that we are trying to match R data.frame behavior. I think it is handy, though it is easy to think of several alternative ways to do this (head(df), showDF(df)) but those will need to be learned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9317] [SPARKR] Change `show` to print D...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/8360 [SPARK-9317] [SPARKR] Change `show` to print DataFrame entries Small update to DataFrame API in SparkR @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rshow Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8360.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8360 commit e3ae104dc1ee47c359b25c21ba56c96022a17558 Author: felixcheung felixcheun...@hotmail.com Date: 2015-08-21T17:10:14Z [SPARK-9317] [SPARKR] Change `show` to print DataFrame entries --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9316] [SPARKR] Add support for filterin...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/8394 [SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for filter / select) Add support for ``` df[df$name == Smith, c(1,2)] df[df$age %in% c(19, 30), 1:2] ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rsubset Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8394.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8394 commit 99109c4b592c57d1ba00a002b3d4b71ece10f954 Author: felixcheung felixcheun...@hotmail.com Date: 2015-08-23T10:06:33Z R: Add support for subsetting + tests commit 42e881a4e4c7d3e669400b44ec24c4af1e10f6da Author: felixcheung felixcheun...@hotmail.com Date: 2015-08-24T07:46:26Z add support for d[d$something0,], more tests commit 16e0ba375ac12788478002ce96ca74206c2d437a Author: felixcheung felixcheun...@hotmail.com Date: 2015-08-24T07:47:15Z update example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7742#discussion_r35783192 --- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala --- @@ -69,8 +69,11 @@ private[r] class RBackendHandler(server: RBackend) case e: Exception = logError(sRemoving $objId failed, e) writeInt(dos, -1) + writeString(dos, sRemoving $objId failed: ${e.getMessage}) } -case _ = dos.writeInt(-1) +case _ = + dos.writeInt(-1) + writeString(dos, Unknown error) --- End diff -- Isn't this an unknown method call? Should this say error unknown method or similar? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7742#discussion_r35783283 --- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala --- @@ -148,6 +151,9 @@ private[r] class RBackendHandler(server: RBackend) case e: Exception = logError(s$methodName on $objId failed, e) writeInt(dos, -1) +// Writing the error message of the cause for the exception. This will be returned +// to user in the R process. +writeString(dos, e.getCause.getMessage) --- End diff -- would it make sense to write the stack too? often time it is much more useful to have the call stack in additional to the exception message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/7742#issuecomment-126209438 looks good! thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7742#discussion_r35901689 --- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala --- @@ -148,6 +151,9 @@ private[r] class RBackendHandler(server: RBackend) case e: Exception = logError(s$methodName on $objId failed, e) --- End diff -- good point, this is logging the InvocationTargetException too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10971][SPARKR] RRunner should allow set...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9179#issuecomment-150311495 +1 on `spark.r.driver.command` and `spark.r.command` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARKR] [SPARK-11199] Improve R context manag...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9185#issuecomment-150318041 I vote for simplicity for SparkR and not have multiple session. In fact I observe it is already messy to handle DataFrame created by a different SparkContext (after stop() and init()). I would argue these concepts do not translate well to R - for which for the most part 'session' == 'process' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9218#discussion_r42899259 --- Diff: R/pkg/R/DataFrame.R --- @@ -276,6 +276,57 @@ setMethod("names<-", } }) +#' @rdname columns +#' @name colnames +setMethod("colnames", + signature(x = "DataFrame"), + function(x) { +columns(x) + }) + +#' @rdname columns +#' @name colnames<- +setMethod("colnames<-", + signature(x = "DataFrame", value = "character"), + function(x, value) { +sdf <- callJMethod(x@sdf, "toDF", as.list(value)) +dataFrame(sdf) + }) + +#' coltypes +#' +#' Set the column types of a DataFrame. +#' +#' @name coltypes +#' @param x (DataFrame) +#' @return value (character) A character vector with the target column types for the given DataFrame +#' @rdname coltypes +#' @aliases coltypes +#' @export +#' @examples +#'\dontrun{ +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' path <- "path/to/file.json" +#' df <- jsonFile(sqlContext, path) +#' coltypes(df) <- c("string", "integer") +#'} +setMethod("coltypes<-", --- End diff -- That's correct. I'm hoping #8984 can be merged soon so I could add a new reverse mapping in the same place. I could make this [WIP] if you'd like --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9192#discussion_r42716848 --- Diff: R/pkg/R/SQLContext.R --- @@ -17,6 +17,34 @@ # SQLcontext.R: SQLContext-driven functions +#' Temporary function to reroute old S3 Method call to new +#' We need to check the class of x to ensure it is SQLContext before dispatching +dispatchFunc <- function(newFuncSig, x, ...) { + funcName <- as.character(sys.call(sys.parent())[[1]]) + f <- get0(paste0(funcName, ".default")) --- End diff -- changed. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9218 [SPARK-9319][SPARKR] Add support for setting column names, types Add support for for colnames, colnames<-, coltypes<- I will merge with PR 8984 (coltypes) once it is in, possibly looking into mapping R type names. @shivaram @sun-rui You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark colnamescoltypes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9218.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9218 commit 071f29f998f86a5a05744c703bf6a9a2384c3805 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-10-22T06:48:43Z Add support for colnames, colnames<-, coltypes<- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9218#discussion_r42786136 --- Diff: R/pkg/R/DataFrame.R --- @@ -276,6 +276,57 @@ setMethod("names<-", } }) +#' @rdname columns +#' @name colnames --- End diff -- In R doc, it will be included under the `columns` page because of `@rdname` notation. https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9218#issuecomment-150311963 @sun-rui `names` and `names<-` are already there, this is to add `colnames`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9192#discussion_r42664614 --- Diff: R/pkg/R/SQLContext.R --- @@ -17,6 +17,34 @@ # SQLcontext.R: SQLContext-driven functions +#' Temporary function to reroute old S3 Method call to new +#' We need to check the class of x to ensure it is SQLContext before dispatching +dispatchFunc <- function(newFuncSig, x, ...) { + funcName <- as.character(sys.call(sys.parent())[[1]]) + f <- get0(paste0(funcName, ".default")) + # Strip sqlContext from list of parameters and then pass the rest along. + # In the following, if '&' is used instead of '&&', it warns about + # "the condition has length > 1 and only the first element will be used" + if (class(x) == "jobj" && + grepl("org.apache.spark.sql.SQLContext", capture.output(show(x { --- End diff -- updated. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9192#discussion_r42662379 --- Diff: R/pkg/R/SQLContext.R --- @@ -17,6 +17,34 @@ # SQLcontext.R: SQLContext-driven functions +#' Temporary function to reroute old S3 Method call to new +#' We need to check the class of x to ensure it is SQLContext before dispatching +dispatchFunc <- function(newFuncSig, x, ...) { + funcName <- as.character(sys.call(sys.parent())[[1]]) + f <- get0(paste0(funcName, ".default")) + # Strip sqlContext from list of parameters and then pass the rest along. + # In the following, if '&' is used instead of '&&', it warns about + # "the condition has length > 1 and only the first element will be used" + if (class(x) == "jobj" && + grepl("org.apache.spark.sql.SQLContext", capture.output(show(x { +.Deprecated(newFuncSig, old = paste0(funcName, "(sqlContext...)")) +f(...) + } else { +f(x, ...) + } +} --- End diff -- The proposal for this is to eliminate the sqlContext parameter from SQLContext-parity methods in R. Primarily this makes methods friendlier in R and more R-like (eg. read.df()). The changed method signature would be the one we would like to keep in the next release. Reasons for this have been discussed in JIRA, but to recap: 1. We only support one sqlContext in R - and having multiple at a time can be very confusing (eg. table not accessible) 2. For hiveCtx vs sqlContext, hiveCtx is preferred --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9192#discussion_r42659643 --- Diff: R/pkg/R/SQLContext.R --- @@ -17,6 +17,34 @@ # SQLcontext.R: SQLContext-driven functions +#' Temporary function to reroute old S3 Method call to new --- End diff -- "reroute" was the term corresponding to "dispatch" "temporary" was referring to the fact that we intend this to go away - please see my other answer regarding your question on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9192#discussion_r42659746 --- Diff: R/pkg/R/SQLContext.R --- @@ -17,6 +17,34 @@ # SQLcontext.R: SQLContext-driven functions +#' Temporary function to reroute old S3 Method call to new +#' We need to check the class of x to ensure it is SQLContext before dispatching +dispatchFunc <- function(newFuncSig, x, ...) { + funcName <- as.character(sys.call(sys.parent())[[1]]) + f <- get0(paste0(funcName, ".default")) --- End diff -- get0 is in {base} right? https://stat.ethz.ch/R-manual/R-devel/library/base/html/exists.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9218#discussion_r42801664 --- Diff: R/pkg/R/DataFrame.R --- @@ -276,6 +276,57 @@ setMethod("names<-", } }) +#' @rdname columns +#' @name colnames +setMethod("colnames", + signature(x = "DataFrame"), + function(x) { +columns(x) + }) + +#' @rdname columns +#' @name colnames<- +setMethod("colnames<-", + signature(x = "DataFrame", value = "character"), + function(x, value) { +sdf <- callJMethod(x@sdf, "toDF", as.list(value)) +dataFrame(sdf) + }) + +#' coltypes +#' +#' Set the column types of a DataFrame. +#' +#' @name coltypes +#' @param x (DataFrame) +#' @return value (character) A character vector with the target column types for the given DataFrame +#' @rdname coltypes +#' @aliases coltypes +#' @export +#' @examples +#'\dontrun{ +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' path <- "path/to/file.json" +#' df <- jsonFile(sqlContext, path) +#' coltypes(df) <- c("string", "integer") +#'} +setMethod("coltypes<-", --- End diff -- Certainly, it is in PR 8984 by @olarayej --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9222#issuecomment-150382796 Do you have benchmark numbers for this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8277][SPARKR] Faster createDataFrame us...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9234#issuecomment-150382185 Hi thanks for the contribution, you might want to check out the ongoing work in https://github.com/apache/spark/pull/9099 and SPARK-11086 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9290 [SPARK-11340][SPARKR] Support setting driver properties when starting Spark from R programmatically or from RStudio Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments. @shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf? @sun-rui You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rdrivermem Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9290 commit d3f8d280098f42615c4d63d64d8797c8c76a8970 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-10-27T05:07:49Z Support setting spark.driver.memory from sparkEnvir when launching JVM backend --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9290#issuecomment-151377437 Manual testing with: ``` library(SparkR, lib.loc='/opt/spark-1.6.0-bin-hadoop2.6/R/lib') sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory = "2g")) ``` before ![image](https://cloud.githubusercontent.com/assets/8969467/10750094/904518ee-7c2e-11e5-8800-c67d45b13183.png) after ![image](https://cloud.githubusercontent.com/assets/8969467/10750097/960b95be-7c2e-11e5-9669-53b6a3fc7665.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11210][SPARKR] Add window functions int...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9196#issuecomment-151381136 looks good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9290#issuecomment-151378899 I checked, the user could also set SPARK_DRIVER_MEMORY before running `sparkR.init()` https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L157 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9218#issuecomment-151379530 @sun-rui That's a great point, `coltypes()` as its signature is defined, would only return a list of simple types. But how would one create a DataFrame with complex type from R? I tried a bit and couldn't get it to work. Either I get `Unsupported type for DataFrame: factor` or `unexpected type: environment` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9290#discussion_r43430720 --- Diff: R/pkg/R/sparkR.R --- @@ -93,7 +93,7 @@ sparkR.stop <- function() { #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark", #' list(spark.executor.memory="1g")) #' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark", -#' list(spark.executor.memory="1g"), +#' list(spark.executor.memory="4g", spark.driver.memory="2g"), --- End diff -- updated. created JIRA for additional programming guide change: https://issues.apache.org/jira/browse/SPARK-11407 I could take a shot at the doc change if you'd like --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8019] [SPARKR] Support SparkR spawning ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/6557#issuecomment-152292218 This is updated by #9179 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11409][SPARKR] Enable url link in R doc...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9363 [SPARK-11409][SPARKR] Enable url link in R doc for Persist Quick one line doc fix link is not clickable ![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png) @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rpersistdoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9363.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9363 commit 808cbac97a66cbf8453b377baf26571a7ccbe707 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-10-29T21:48:31Z Enable url link in R doc for Persist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11294][SPARKR] Improve R doc for read.d...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9261 [SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable. Several text issues: ![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png) - text collapsed into a single paragraph - text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:" @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rdocreadwritedf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9261.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9261 commit edd58ef8f2aa64dcb3e5c6a777bffcb74b255cec Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-10-24T03:16:00Z Add example for R read.df, write.df; fix formatting and text truncation for write.df, saveAsTable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-151611692 @shivaram as I was discussing with @sun-rui in #9218 - I think coltypes() could probably handle complex type (JVM -> R) by mapping "map<string,int>" -> "environment", "array" -> "list" but this conversion is not perfect, at least from R -> JVM. I can have this ``` > e <- new.env() > e[["abd"]] <- 1276 > e[["84798"]] <- "abc" > l <- list(e) > df <- createDataFrame(sqlContext, list(l)) > df DataFrame[_1:map<string,string>] ``` So `env` supports mixed type values they are mapped to `map<string, string>` on JVM. In fact, this DataFrame doesn't seem to work properly ``` > head(df) 15/10/27 19:02:02 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6) scala.MatchError: 1276.0 (of class java.lang.Double) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11343] [ML] Allow float and double pred...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9296#discussion_r43195310 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala --- @@ -72,10 +73,13 @@ final class RegressionEvaluator @Since("1.4.0") (@Since("1.4.0") override val ui @Since("1.4.0") override def evaluate(dataset: DataFrame): Double = { val schema = dataset.schema -SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType) -SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType) +val predictionType = schema($(predictionCol)).dataType +require(predictionType == FloatType || predictionType == DoubleType) --- End diff -- should we add a message to require()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-151654117 The error looks different, possibly related but not exactly same cause --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11215] [ML] Add multiple columns suppor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9183#discussion_r43221520 --- Diff: R/pkg/inst/tests/test_mllib.R --- @@ -56,14 +56,3 @@ test_that("feature interaction vs native glm", { rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), iris) expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals) }) - -test_that("summary coefficients match with native glm", { --- End diff -- why is this removed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-152069034 I'm concerned with the lack of "reversibility" as in `coltypes(x) <- coltypes(x)` Could I propose expanding on your suggestion to return `na` for non-atomic type? `coltypes<-()` can skip over `na` `coltypes()` can have an optional argument default atomic.only = TRUE - which can be set to FALSE to get `list`, `environment`, `struct` etc for non-atomic vector types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11329] [SQL] Support star expansion for...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9343#discussion_r43342070 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -146,7 +146,11 @@ case class Alias(child: Expression, name: String)( override def toAttribute: Attribute = { if (resolved) { - AttributeReference(name, child.dataType, child.nullable, metadata)(exprId, qualifiers) + // Append the name of the alias as a qualifier. This lets us resolve things like: + // (SELECT struct(a,b) AS x FROM ...).SELECT x.* + // TODO: is this the best way to do this? Should Alias just have nameas the qualifier? --- End diff -- 'nameas' -> 'name as'? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9290#discussion_r43346365 --- Diff: R/pkg/R/sparkR.R --- @@ -93,7 +93,7 @@ sparkR.stop <- function() { #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark", #' list(spark.executor.memory="1g")) #' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark", -#' list(spark.executor.memory="1g"), +#' list(spark.executor.memory="4g", spark.driver.memory="2g"), --- End diff -- I could take this out - this is more for API doc/roxygen. If we are moving it out it would be visible only when someone is reading the code. Maybe pointing to the SparkR programming guide is better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11210][SPARKR][WIP] Add window function...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9196#issuecomment-150021663 This is merging 2 PR/JIRA? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11284] [ML] ALS produces float predicti...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9252#issuecomment-150700335 Shouldn't this be fixed/casted in the `RegressionEvaluator` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9463#issuecomment-155598069 @shivaram ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9489#discussion_r44482198 --- Diff: R/pkg/R/functions.R --- @@ -974,6 +1006,54 @@ setMethod("soundex", column(jc) }) +#' stddev +#' +#' Aggregate function: alias for \link{stddev_samp} +#' +#' @rdname stddev +#' @name stddev +#' @family agg_funcs +#' @export +#' @examples \dontrun{stddev(df$c)} +setMethod("stddev", + signature(x = "Column"), + function(x) { --- End diff -- This is an alias on the Scala/Spark side --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9489#discussion_r44482223 --- Diff: R/pkg/R/functions.R --- @@ -1168,6 +1248,54 @@ setMethod("upper", column(jc) }) +#' variance +#' +#' Aggregate function: alias for \link{var_samp}. +#' +#' @rdname variance +#' @name variance +#' @family agg_funcs +#' @export +#' @examples \dontrun{variance(df$c)} +setMethod("variance", + signature(x = "Column"), + function(x) { --- End diff -- Same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11567][PYTHON] Add Python API for corr ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9536#issuecomment-155597890 @davis? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9579#issuecomment-155258819 doc comment, looks good to me otherwise. @sun-rui @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9579#discussion_r44361878 --- Diff: R/pkg/R/DataFrame.R --- @@ -2152,3 +2152,47 @@ setMethod("with", newEnv <- assignNewEnv(data) eval(substitute(expr), envir = newEnv, enclos = newEnv) }) + +#' Returns the column types of a DataFrame. +#' +#' @name coltypes +#' @title Get column types of a DataFrame +#' @param x (DataFrame) +#' @return value (character) A character vector with the column types of the given DataFrame +#' @rdname coltypes --- End diff -- Please add family ``` #' @family dataframe_funcs ``` and example, like ``` #' @examples #' \dontrun{ #' with(irisDf, nrow(Sepal_Width)) #' } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Flaky SparkR test: test_sparkSQL.R: sample on ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9549#issuecomment-154880937 @shivaram @adrian555 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Flaky SparkR test: test_sparkSQL.R: sample on ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9549 Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame Make sample test less flaky by setting the seed Tested with ``` repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } } ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rsample Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9549.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9549 commit 6869ba1acce68d92e0b517ef19e376f94f0d8d9a Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-08T22:11:46Z Make sample test less flaky --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9380#issuecomment-155970380 These SparkR support has just been added so this change breaks tests ``` 1. Failure (at test_sparkSQL.R#1010): group by, agg functions -- 0 not equal to df3_local[df3_local$name == "Andy", ][1, 2] NaN - 0 == NaN 2. Failure (at test_sparkSQL.R#1041): group by, agg functions -- 0 not equal to df7_local[df7_local$name == "ID2", ][1, 2] NaN - 0 == NaN ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung closed the pull request at: https://github.com/apache/spark/pull/9218 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9654 [SPARK-9319][SPARKR] Add support for setting column names, types Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 @shivaram @sun-rui You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark colnamescoltypes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9654.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9654 commit 7eb4815ffaf62dd7678b81f7aaf1bb49878a9303 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-10-22T06:48:43Z Add support for colnames, colnames<-, coltypes<- commit 009d755c43357b8e7839783e5975fa40cc1b1b66 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-01T20:33:28Z Take R types instead to map to JVM types, add check for NA to keep column commit 44662e62be27b55c33c04fae7b08ad1dc52a7c48 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-01T21:27:10Z This seems to fix the Rd error - no idea why it worked before. commit e846f9df26f593f70a0b140cd8226dec950802ff Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-02T00:33:40Z fix test broken from column name change from cast commit c102bc0f3436d5840f706f38bd1882fba382e088 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-12T06:40:08Z rebase, merge with coltypes change, fix generic, doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10500][SPARKR] sparkr.zip cannot be cre...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9390#issuecomment-156024874 I don't fully understand the issue, but why we have to use `.libPaths` and not `library(... lib.loc= )`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9380#issuecomment-156024999 @JihongMA yap that should fix them --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44734240 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,107 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes - }) \ No newline at end of file + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", signature="DataFrame", definition= +function(object) { --- End diff -- This is a little odd, try with the same format of coltypes: ``` setMethod("coltypes", signature(x = "DataFrame"), function(x) { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44734873 --- Diff: R/pkg/inst/tests/test_sparkSQL.R --- @@ -1525,6 +1525,22 @@ test_that("Method coltypes() to get R's data types of a DataFrame", { expect_equal(coltypes(x), "map<string,string>") }) +test_that("Method str()", { + # Structure of Iiris + iris2 <- iris + iris2$col <- TRUE + irisDF2 <- createDataFrame(sqlContext, iris2) + out <- capture.output(str(irisDF2)) + expect_equal(length(out), 7) + + # A random dataset with many columns + x <- runif(200, 1, 10) + df <- data.frame(t(as.matrix(data.frame(x,x,x,x,x,x,x,x,x + DF <- createDataFrame(sqlContext, df) + out <- capture.output(str(DF)) + expect_equal(length(out), 103) --- End diff -- could you please check for some specific values/text? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44734936 --- Diff: R/pkg/R/generics.R --- @@ -1050,4 +1049,7 @@ setGeneric("with") #' @rdname coltypes #' @export -setGeneric("coltypes", function(x) { standardGeneric("coltypes") }) \ No newline at end of file +setGeneric("coltypes", function(x) { standardGeneric("coltypes") }) + +#' @export +setGeneric("str") --- End diff -- this should be sorted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44735202 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,107 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes - }) \ No newline at end of file + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", signature="DataFrame", definition= +function(object) { + + # A synonym for easily concatenating strings + "%++%" <- function(x, y) { +paste(x, y, sep = "") + } + + # TODO: These could be made global parameters, though in R it's not the case + DEFAULT_HEAD_ROWS <- 6 + MAX_CHAR_PER_ROW <- 120 + MAX_COLS <- 100 + + # Get the column names and types of the DataFrame + names <- names(object) + types <- coltypes(object) + + # Get the number of rows. + # TODO: Ideally, this should be cached + cachedCount <- nrow(object) + + # Get the first elements of the dataset. Limit number of columns accordingly + dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS) + } else { + head(object, DEFAULT_HEAD_ROWS) + } + + # The number of observations will be displayed only if the number + # of rows of the dataset has already been cached. + if (!is.null(cachedCount)) { +cat("'" %++% class(object) %++% "': " %++% cachedCount %++% " obs. of " %++% + length(names) %++% " variables:\n") + } else { +cat("'" %++% class(object) %++% "': " %++% length(names) %++% " variables:\n") + } + + # Whether the ... should be printed at the end of each row + ellipsis <- FALSE + + # Add ellipsis (i.e., "...") if there are more rows than shown + if (!is.null(cachedCount)) { +if (nrow(object) > DEFAULT_HEAD_ROWS) { + ellipsis <- TRUE +} + } + + if (nrow(dataFrame) > 0) { +for (i in 1 : ncol(dataFrame)) { + firstElements <- "" + + # Get the first elements for each column + if (types[i] == "chr") { +firstElements <- paste("\"" %++% dataFrame[,i] %++% "\"", collapse = " ") + } else { +firstElements <- paste(dataFrame[,i], collapse = " ") + } + + # Add the corresponding number of spaces for alignment + spaces <- paste(rep(" ", max(nchar(names) - nchar(names[i]))), collapse="") + + # Get the short type. For 'character', it would be 'chr'; + # 'for numeric', it's 'num', etc. + dataType <- SHORT_TYPES[[types[i]]] + if (is.null(dataType)) { +dataType <- substring(types[i], 1, 3) + } + + # Concatenate the colnames, coltypes, and first + # elements of each column + line <- " $ " %++% names[i] %++% spaces %++% ": " %++% +dataType %++% " " %++% firstElements + + # Chop off extra characters if this is too long + cat(substr(line, 1, MAX_CHAR_PER_ROW)) --- End diff -- do we need to chop off extra to fit in 4 char in " ..."? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as
[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9463#issuecomment-156289622 thanks for catching that, I did some test I suspect they are caused by generics.R, so removing those. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9680 [SPARK-11715][SPARKR] Add R support corr for Column Aggregration Need to match existing method signature You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rcorr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9680 commit 68f1254ba525297857b0fc6959ca5d54c1509af9 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-13T00:18:28Z Add R support corr for Column Aggregration commit 940164c32aa75f3c74bdb966ecac69df602b5aa2 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-13T00:20:22Z fix doc text --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44734132 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,107 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes - }) \ No newline at end of file + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", signature="DataFrame", definition= +function(object) { + + # A synonym for easily concatenating strings + "%++%" <- function(x, y) { +paste(x, y, sep = "") + } + + # TODO: These could be made global parameters, though in R it's not the case + DEFAULT_HEAD_ROWS <- 6 + MAX_CHAR_PER_ROW <- 120 + MAX_COLS <- 100 + + # Get the column names and types of the DataFrame + names <- names(object) + types <- coltypes(object) + + # Get the number of rows. + # TODO: Ideally, this should be cached + cachedCount <- nrow(object) + + # Get the first elements of the dataset. Limit number of columns accordingly + dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS) + } else { + head(object, DEFAULT_HEAD_ROWS) + } + + # The number of observations will be displayed only if the number + # of rows of the dataset has already been cached. + if (!is.null(cachedCount)) { +cat("'" %++% class(object) %++% "': " %++% cachedCount %++% " obs. of " %++% + length(names) %++% " variables:\n") + } else { +cat("'" %++% class(object) %++% "': " %++% length(names) %++% " variables:\n") + } + + # Whether the ... should be printed at the end of each row + ellipsis <- FALSE + + # Add ellipsis (i.e., "...") if there are more rows than shown + if (!is.null(cachedCount)) { +if (nrow(object) > DEFAULT_HEAD_ROWS) { --- End diff -- collapse to `if (!is.null(cachedCount) && nrow(object) > DEFAULT_HEAD_ROWS)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44734047 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,107 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes - }) \ No newline at end of file + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", signature="DataFrame", definition= +function(object) { + + # A synonym for easily concatenating strings + "%++%" <- function(x, y) { +paste(x, y, sep = "") --- End diff -- use `paste0`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9680#issuecomment-156331258 I think 9366 is about computing corr or cov matrix whereas this is computing corr between two columns. They seem to be useful in their own ways. Also this is already supported in Scala and Python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9680#issuecomment-156484659 So #9366 is for all columns in DataFrame x=y or different x, y DataFrames And this #9680 is for 2 columns in one DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867038 --- Diff: R/pkg/R/generics.R --- @@ -971,6 +986,9 @@ setGeneric("size", function(x) { standardGeneric("size") }) #' @export setGeneric("soundex", function(x) { standardGeneric("soundex") }) +#' @export --- End diff -- add `#' @rdname str` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867045 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' --- End diff -- please remove unneeded empty line after `\dontrun {` and before `}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867068 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +DEFAULT_HEAD_ROWS <- 6 +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS) + } else { + head(object, DEFAULT_HEAD_ROWS) + } --- End diff -- if you call `head(object)` it would return the first 6 rows by default, perhaps leave it to the default behavior instead of passing in `DEFAULT_HEAD_ROWS` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867029 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame +#' @examples \dontrun{ +#' +#' # Create a DataFrame from the Iris dataset +#' irisDF <- createDataFrame(sqlContext, iris) +#' +#' # Show the structure of the DataFrame +#' str(irisDF) +#' +#' } +setMethod("str", + signature(object = "DataFrame"), + function(object) { + +# TODO: These could be made global parameters, though in R it's not the case +DEFAULT_HEAD_ROWS <- 6 +MAX_CHAR_PER_ROW <- 120 +MAX_COLS <- 100 + +# Get the column names and types of the DataFrame +names <- names(object) +types <- coltypes(object) + +# Get the number of rows. +# TODO: Ideally, this should be cached +cachedCount <- nrow(object) + +# Get the first elements of the dataset. Limit number of columns accordingly +dataFrame <- if (ncol(object) > MAX_COLS) { + head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS) + } else { + head(object, DEFAULT_HEAD_ROWS) + } + +# The number of observations will be displayed only if the number +# of rows of the dataset has already been cached. +if (!is.null(cachedCount)) { + cat(paste0("'", class(object), "': ", cachedCount, " obs. of ", +length(names), " variables:\n")) +} else { + cat(paste0("'", class(object), "': ", length(names), " variables:\n")) +} + +# Whether the ... should be printed at the end of each row +ellipsis <- FALSE + +# Add ellipsis (i.e., "...") if there are more rows than shown +if (!is.null(cachedCount) && (cachedCount > DEFAULT_HEAD_ROWS)) { + ellipsis <- TRUE +} + +if (nrow(dataFrame) > 0) { + for (i in 1 : ncol(dataFrame)) { +firstElements <- "" + +# Get the first elements for each column +if (types[i] == "chr") { --- End diff -- I understand that, the check on SHORT_TYPES is below in line 2278 though? here, `types` is still "character" right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867049 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame --- End diff -- this should be `#' @rdname str` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867056 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs +#' @param x a DataFrame --- End diff -- replace `x` with `object` to match the signature below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9613#discussion_r44867054 --- Diff: R/pkg/R/DataFrame.R --- @@ -2200,4 +2200,101 @@ setMethod("coltypes", rTypes[naIndices] <- types[naIndices] rTypes + }) + +#' Display the structure of a DataFrame, including column names, column types, as well as a +#' a small sample of rows. +#' @name str +#' @title Compactly display the structure of a dataset +#' @rdname str_data_frame +#' @family dataframe_funcs --- End diff -- This has been updated recently - you should see when you rebase the latest in master - it would be `#' @family DataFrame functions` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11684] [R] [ML] [Doc] Update SparkR glm...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9727#issuecomment-157179595 looks good @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9680#discussion_r44974143 --- Diff: R/pkg/R/functions.R --- @@ -259,6 +259,20 @@ setMethod("column", function(x) { col(x) }) +#' corr +#' +#' Computes the Pearson Correlation Coefficient for two Columns. +#' +#' @rdname corr +#' @name corr +#' @family math_funcs +#' @export +#' @examples \dontrun{corr(df$c, df$d)} +setMethod("corr", signature(x = "Column", col1 = "Column", col2 = "missing", method = "missing"), --- End diff -- as for doc, the DataFrame corr in stats.R has `@rdname statfunctions` this one has `@rdname corr` so they go to different HTML page generated by roxygen2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9680#discussion_r44973809 --- Diff: R/pkg/R/functions.R --- @@ -259,6 +259,20 @@ setMethod("column", function(x) { col(x) }) +#' corr +#' +#' Computes the Pearson Correlation Coefficient for two Columns. +#' +#' @rdname corr +#' @name corr +#' @family math_funcs +#' @export +#' @examples \dontrun{corr(df$c, df$d)} +setMethod("corr", signature(x = "Column", col1 = "Column", col2 = "missing", method = "missing"), --- End diff -- right, I like the approach of changing the existing generic definition. perhaps we should align the method signature with the stats::cor ``` cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) ``` do you know why we decide to name it `corr` (vs. `cor`) in other places? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11756][SPARKR] Fix use of aliases - Spa...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9750 [SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly Fix use of aliases and changes uses of @rdname and @seealso `@aliases` is the hint for `?` - it should not be linked to some other name - those should be @seealso https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html Clean up usage on @family, as multiple use of @family with the same @rdname is causing duplicated See Also html blocks Also changing some @rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize @shivaram @yanboliang You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rdocaliases Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9750 commit b2b0c3f7d290c9fa54b11fae2b5782172ddc8c78 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-17T00:22:32Z Fix use of aliases and changes uses of @rdname and @seealso commit 474d6af0533b56165a8189f16eda2bf6e1f21ee9 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-17T00:30:34Z minor typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9680#discussion_r45019825 --- Diff: R/pkg/R/functions.R --- @@ -259,6 +259,20 @@ setMethod("column", function(x) { col(x) }) +#' corr +#' +#' Computes the Pearson Correlation Coefficient for two Columns. +#' +#' @rdname corr +#' @name corr +#' @family math_funcs +#' @export +#' @examples \dontrun{corr(df$c, df$d)} +setMethod("corr", signature(x = "Column", col1 = "Column", col2 = "missing", method = "missing"), --- End diff -- Thinking more about this, I think what's being added in #9366 matches https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html better. When we are adding that support in R we could add it as `cor` matching stats::cor. Meanwhile I'll change `corr` to what you suggested with `function(x, ...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9680#issuecomment-157268953 @sun-rui I updated it. I think it's a bit not as strongly typed as I'd like but if I add `col2 = "Column"` to signature I get this error: ``` Error in match.call(definition, call, expand.dots, envir) : unused argument (col2 = c("Column", "")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9489#issuecomment-155244366 @mengxr possibly.. though there are usage difference with SparkR DataFrame/Column as compared to R data.frame (eg. `agg` vs [`aggregate`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html), `var(faithful$eruptions)` would have been a Column with DataFrame) It might be more confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9579#discussion_r44359782 --- Diff: R/pkg/R/schema.R --- @@ -115,20 +115,7 @@ structField.jobj <- function(x) { } checkType <- function(type) { - primtiveTypes <- c("byte", - "integer", - "float", - "double", - "numeric", - "character", - "string", - "binary", - "raw", - "logical", - "boolean", - "timestamp", - "date") - if (type %in% primtiveTypes) { + if (type %in% names(PRIMITIVE_TYPES)) { --- End diff -- To avoid search, `if (!is.null(PRIMITIVE_TYPES[[type]]))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9394#issuecomment-153241163 it should be `@seealso \link{createDataFrame}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9218#discussion_r43718136 --- Diff: R/pkg/R/DataFrame.R --- @@ -276,6 +276,75 @@ setMethod("names<-", } }) +#' @rdname columns +#' @name colnames +setMethod("colnames", + signature(x = "DataFrame"), + function(x) { +columns(x) + }) + +#' @rdname columns +#' @name colnames<- +setMethod("colnames<-", + signature(x = "DataFrame", value = "character"), + function(x, value) { +sdf <- callJMethod(x@sdf, "toDF", as.list(value)) +dataFrame(sdf) + }) + +rToScalaTypes <- new.env() +rToScalaTypes[["integer"]] <- "integer" # in R, integer is 32bit +rToScalaTypes[["numeric"]] <- "double" # in R, numeric == double which is 64bit +rToScalaTypes[["double"]]<- "double" +rToScalaTypes[["character"]] <- "string" +rToScalaTypes[["logical"]] <- "boolean" + +#' coltypes --- End diff -- that's the plan too. could we merge #8984 now? :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9218#discussion_r43718270 --- Diff: R/pkg/R/DataFrame.R --- @@ -276,6 +276,75 @@ setMethod("names<-", } }) +#' @rdname columns +#' @name colnames +setMethod("colnames", + signature(x = "DataFrame"), + function(x) { +columns(x) + }) + +#' @rdname columns +#' @name colnames<- +setMethod("colnames<-", + signature(x = "DataFrame", value = "character"), + function(x, value) { +sdf <- callJMethod(x@sdf, "toDF", as.list(value)) +dataFrame(sdf) + }) + +rToScalaTypes <- new.env() +rToScalaTypes[["integer"]] <- "integer" # in R, integer is 32bit +rToScalaTypes[["numeric"]] <- "double" # in R, numeric == double which is 64bit +rToScalaTypes[["double"]]<- "double" +rToScalaTypes[["character"]] <- "string" +rToScalaTypes[["logical"]] <- "boolean" + +#' coltypes +#' +#' Set the column types of a DataFrame. +#' +#' @name coltypes +#' @param x (DataFrame) +#' @return value (character) A character vector with the target column types for the given +#'DataFrame. Column types can be one of integer, numeric/double, character, logical, or NA +#'to keep that column as-is. +#' @rdname coltypes +#' @aliases coltypes +#' @export +#' @examples +#'\dontrun{ +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' path <- "path/to/file.json" +#' df <- jsonFile(sqlContext, path) +#' coltypes(df) <- c("character", "integer") +#' coltypes(df) <- c(NA, "numeric") +#'} +setMethod("coltypes<-", + signature(x = "DataFrame", value = "character"), + function(x, value) { +cols <- columns(x) +ncols <- length(cols) +if (length(value) == 0 || length(value) != ncols) { --- End diff -- I agree, that's why we should check for length = 0. I am not sure about supporting it though since `emptyDataFrame` is not callable from R, AFAIK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11407][SPARKR] Add doc for running from...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9401#issuecomment-152870335 And generally people blog about using `.libPath` but this could cause all packages to be installed to SparkR location as it becomes the default: ``` .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > install.packages("lubridate") Installing package into â/opt/spark-1.6.0-bin-hadoop2.6/R/libâ (as âlibâ is unspecified) ``` https://stat.ethz.ch/R-manual/R-devel/library/utils/html/install.packages.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9394#issuecomment-152870465 why don't we point to http://spark.apache.org/docs/latest/sparkr.html for now instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11407][SPARKR] Add doc for running from...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9401 [SPARK-11407][SPARKR] Add doc for running from RStudio ![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png) (This is without css, but you get the idea) @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rstudioprogrammingguide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9401.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9401 commit a0485741f86656c0f4c5a588fd69598f04f49cd1 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-01T22:22:28Z Add doc for running from RStudio --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9218#issuecomment-152862908 ``` Error : /home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/man/colnames.Rd: Sections \title, and \name must exist and be unique in Rd files ERROR: installing Rd objects failed for package 'SparkR' ``` This is odd, there shouldn't be a Rd file for colnames, I haven't changed this and it worked before ``` #' @rdname columns #' @name colnames setMethod("colnames", ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-153870143 I'm a bit confused - I thought `map` `array` `struct` should return NA? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11260][SPARKR] with() function support
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/9443#discussion_r43940871 --- Diff: R/pkg/R/DataFrame.R --- @@ -2045,3 +2045,34 @@ setMethod("attach", } attach(newEnv, pos = pos, name = name, warn.conflicts = warn.conflicts) }) + +#' Evaluate a R expression in an environment constructed from a DataFrame +#' with() allows access to columns of a DataFrame by simply referring to +#' their name. It appends every column of a DataFrame into a new +#' environment. Then, the given expression is evaluated in this new +#' environment. +#' +#' @rdname with +#' @title Evaluate a R expression in an environment constructed from a DataFrame +#' @param data (DataFrame) DataFrame to use for constructing an environment. +#' @param expr (expression) Expression to evaluate. +#' @param ... arguments to be passed to future methods. +#' @examples +#' \dontrun{ +#' with(irisDf, nrow(Sepal_Width)) +#' } +#' @seealso \link{attach} +setMethod("with", + signature(data = "DataFrame"), + function(data, expr, ...) { +stopifnot(!missing(data)) +stopifnot(inherits(data, "DataFrame")) +cols <- columns(data) +stopifnot(length(cols) > 0) + +newEnv <- new.env() +for (i in 1:length(cols)) { + assign(x = cols[i], value = data[, cols[i]], envir = newEnv) +} --- End diff -- typically we would add an internal S3 method in this file for helper function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-153904230 looks good. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/8984#discussion_r43976192 --- Diff: R/pkg/R/DataFrame.R --- @@ -1914,3 +1914,46 @@ setMethod("attach", } attach(newEnv, pos = pos, name = name, warn.conflicts = warn.conflicts) }) + +#' Returns the column types of a DataFrame. +#' +#' @name coltypes +#' @title Get column types of a DataFrame +#' @param x (DataFrame) +#' @return value (character) A character vector with the column types of the given DataFrame +#' @rdname coltypes +setMethod("coltypes", + signature(x = "DataFrame"), + function(x) { +# Get the data types of the DataFrame by invoking dtypes() function +types <- sapply(dtypes(x), function(x) {x[[2]]}) + +# Map Spark data types into R's data types using DATA_TYPES environment +rTypes <- sapply(types, USE.NAMES=F, FUN=function(x) { + + # Check for primitive types + type <- PRIMITIVE_TYPES[[x]] + if (is.null(type)) { +# Check for complex types +for (t in names(COMPLEX_TYPES)) { --- End diff -- Or ``` Filter(function(t) { grep("start", txt) == 1 }, names(COMPLEX_TYPES)) ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11086][SPARKR] Use dropFactors column-w...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9099#issuecomment-153938620 @zero323 you could add the test code to SPARK-11283 so that they could be added back then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9463#issuecomment-153942273 @yu-iskw btw, these are two other false positives that you might want to follow up with lintr: ``` R/pkg/inst/tests/test_sparkSQL.R:907:53: style: Commented code should be removed. expect_equal(collect(select(df2, lpad(df2$a, 8, "#")))[1, 1], "###aaads") ^~~ R/pkg/R/RDD.R:228:63: style: Commented code should be removed. #' http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. ^~~~ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/9463#issuecomment-153938433 Done. they won't do anything because of `@noRd` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9489 [SPARK-11468][SPARKR] add stddev/variance agg functions for Column Checked names, none of them should conflict with anything in base @shivaram @davies @rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rstddev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9489.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9489 commit 680b4759e04c7d371fc555e220730e0a4da251f5 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-05T02:37:04Z Add stddev and friends commit f63608e3d36bae0544281a94c732760bfebfcc6d Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-05T05:05:57Z add aggFunction by name and on Column commit e0fda371d3691a21672a94e9e0b383890cc745a6 Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-05T08:27:15Z Add tests, checked supported functions for GroupedData --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11567][PYTHON] Add Python API for corr ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/9536 [SPARK-11567][PYTHON] Add Python API for corr in group like `df.agg(corr("col1", "col2")` @davies You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark pyfunc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9536.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9536 commit 49773f3aa812c70049c66b3de48fc188be6d10be Author: felixcheung <felixcheun...@hotmail.com> Date: 2015-11-07T07:43:31Z Add corr that can be used in agg --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/8984#discussion_r43797767 --- Diff: R/pkg/R/types.R --- @@ -0,0 +1,41 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# types.R. This file handles the data type mapping between Spark and R + +# The primitive data types, where names(PRIMITIVE_TYPES) are Scala types whereas +# values are equivalent R types. +PRIMITIVE_TYPES <- c( + "byte"="integer", + "tinyint"="integer", + "integer"="integer", + "float"="numeric", + "double"="numeric", + "numeric"="numeric", --- End diff -- I'm concern with this. decimal does not map exactly to numeric And it says it is not supported: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org