from:"felixcheung"

[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-18 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-70450284
  
I've tested this PR but the result seems to be off.
Parquet generated from Hive with timestamp values set by 
'from_utc_timestamp('1970-01-01 08:00:00','PST')'

What I see with this PR:
scala t.take(10).foreach(println(_))
...
15/01/18 22:06:41 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
file:/users/x/parquetwithtimestamp start: 0 end: 25448 length: 25448 hosts: [] 
requestedSchema: message root {
  optional binary code (UTF8);
  optional binary description (UTF8);
  optional int32 total_emp;
  optional int32 salary;
  optional int96 timestamp;
}
 readSupportMetadata: 
{org.apache.spark.sql.parquet.row.metadata={type:struct,fields:[{name:code,type:string,nullable:true,metadata:{}},{name:description,type:string,nullable:true,metadata:{}},{name:total_emp,type:integer,nullable:true,metadata:{}},{name:salary,type:integer,nullable:true,metadata:{}},{name:timestamp,type:timestamp,nullable:true,metadata:{}}]},
 
org.apache.spark.sql.parquet.row.requested_schema={type:struct,fields:[{name:code,type:string,nullable:true,metadata:{}},{name:description,type:string,nullable:true,metadata:{}},{name:total_emp,type:integer,nullable:true,metadata:{}},{name:salary,type:integer,nullable:true,metadata:{}},{name:timestamp,type:timestamp,nullable:true,metadata:{}}]}}}
15/01/18 22:06:41 WARN ParquetRecordReader: Can not initialize counter due 
to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
15/01/18 22:06:41 INFO InternalParquetRecordReader: RecordReader 
initialized will read a total of 823 records.
15/01/18 22:06:41 INFO InternalParquetRecordReader: at row 0. reading next 
block
15/01/18 22:06:41 INFO CodecPool: Got brand-new decompressor [.snappy]
15/01/18 22:06:41 INFO InternalParquetRecordReader: block read in memory in 
27 ms. row count = 823
[00-,All Occupations,134354250,40690,1974-01-07 17:58:00.08896]
[11-,Management occupations,6003930,96150,1974-01-07 17:58:00.08896]

Expect: 1970-01-01 08:00:00

Actual: 1974-01-07 17:58:00.08896

Any idea?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-21 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-70904105
  
Good to hear. Here's how I create my test data, I run this in Hive and then 
take the data from HDFS directly and Spark is able to read/parse the data file 
(with issue above):

Set parquet.compression = SNAPPY;

DROP TABLE testdata;

CREATE TABLE testdata
STORED AS PARQUET
AS SELECT a.*, from_utc_timestamp('1970-01-01 08:00:00','PST') as timestamp
FROM sample_07 AS a;


I have looked into this a fair bit and attempted a fix. Thanks for working 
on this fix and let me know if I could help in any way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-20 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/3820#discussion_r23256480
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,15 @@ Configuration of Parquet can be done using the 
`setConf` method on SQLContext or
   /td
 /tr
 tr
+  tdcodespark.sql.parquet.int96AsTimestamp/code/td
+  tdtrue/td
+  td
+Some Parquet-producing systems, in particular Impala, store Timestamp 
into INT96. Spark would also
+store Timestamp as INT96 because we need to avoid precision lost of 
the nanoseconds field. This
+flag tells Spark SQL to interpret INT96 data as a timestamp to provide 
compatibility with these systems.
+  /td
+/tr
--- End diff --

From my digging only Parquet-format 2.2 has the TIMESTAMP and 
TIMESTAMP_MILLS types. Cloudera is still on 1.5.0
Hive/Impala has been writing this INT96 nano sec format that's different.

--- Original Message ---

From: Michael Armbrust notificati...@github.com
Sent: January 20, 2015 11:25 AM
To: apache/spark sp...@noreply.github.com
Cc: Felix Cheung felixcheun...@hotmail.com
Subject: Re: [spark] [SPARK-4987] [SQL] parquet timestamp type support 
(#3820)

 @@ -581,6 +581,15 @@ Configuration of Parquet can be done using the 
`setConf` method on SQLContext or
/td
  /tr
  tr
 +  tdcodespark.sql.parquet.int96AsTimestamp/code/td
 +  tdtrue/td
 +  td
 +Some Parquet-producing systems, in particular Impala, store 
Timestamp into INT96. Spark would also
 +store Timestamp as INT96 because we need to avoid precision lost of 
the nanoseconds field. This
 +flag tells Spark SQL to interpret INT96 data as a timestamp to 
provide compatibility with these systems.
 +  /td
 +/tr

Yeah, I agree that it's weird though.  Perhaps we should ask the parquet
list why they don't support the int 96 version.
On Jan 20, 2015 11:21 AM, Cheng Lian notificati...@github.com wrote:

 In docs/sql-programming-guide.md
 https://github.com/apache/spark/pull/3820#discussion-diff-23247492:

  @@ -581,6 +581,15 @@ Configuration of Parquet can be done using the 
`setConf` method on SQLContext or
 /td
   /tr
   tr
  +  tdcodespark.sql.parquet.int96AsTimestamp/code/td
  +  tdtrue/td
  +  td
  +Some Parquet-producing systems, in particular Impala, store 
Timestamp into INT96. Spark would also
  +store Timestamp as INT96 because we need to avoid precision lost 
of the nanoseconds field. This
  +flag tells Spark SQL to interpret INT96 data as a timestamp to 
provide compatibility with these systems.
  +  /td
  +/tr

 Oh, I see the difference here. Double checked, Parquet only provides
 TIMESTAMP and TIMESTAMP_MILLIS

 â
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3820/files#r23247492.


---
Reply to this email directly or view it on GitHub:
https://github.com/apache/spark/pull/3820/files#r23247815


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-03-24 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-85816756
  
@redbaron, @oscaroboto The same applies to memory consumption I'm afraid. 
There isn't a way to constraint how much 
[R](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html) 
(same for Python?) allocates (on Unix, 64-bit).

However, it looks like in 
[Mesos](http://mesos.apache.org/documentation/latest/configuration/) and 
[YARN](http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/)
 this can be controlled via cgroups.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8307] [SQL] improve timestamp from parq...

2015-06-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/6759#discussion_r32299296
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala ---
@@ -498,69 +493,21 @@ private[parquet] object CatalystArrayConverter {
 }
 
 private[parquet] object CatalystTimestampConverter {
-  // TODO most part of this comes from Hive-0.14
-  // Hive code might have some issues, so we need to keep an eye on it.
-  // Also we use NanoTime and Int96Values from parquet-examples.
-  // We utilize jodd to convert between NanoTime and Timestamp
-  val parquetTsCalendar = new ThreadLocal[Calendar]
-  def getCalendar: Calendar = {
-// this is a cache for the calendar instance.
-if (parquetTsCalendar.get == null) {
-  
parquetTsCalendar.set(Calendar.getInstance(TimeZone.getTimeZone(GMT)))
-}
-parquetTsCalendar.get
-  }
-  val NANOS_PER_SECOND: Long = 10
-  val SECONDS_PER_MINUTE: Long = 60
-  val MINUTES_PER_HOUR: Long = 60
-  val NANOS_PER_MILLI: Long = 100
+  // see 
http://stackoverflow.com/questions/466321/convert-unix-timestamp-to-julian
+  val JULIAN_DAY_OF_EPOCH = 2440587.5
--- End diff --

if we generate parquet with hive query like this we could compare the 
timestamp value in Spark?
```
USE default;
DROP TABLE timestamptable;

CREATE TABLE timestamptable 
STORED AS PARQUET
AS
SELECT cast(from_unixtime(unix_timestamp()) as timestamp) as t, * FROM 
sample_07;
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-01 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/6743#discussion_r33654496
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RUtils.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.r
+
+import java.io.File
+
+import org.apache.spark.SparkException
+
+private[spark] object RUtils {
+  /**
+   * Get the SparkR package path in the local spark distribution.
+   */
+  def localSparkRPackagePath: Option[String] = {
+val sparkHome = sys.env.get(SPARK_HOME)
+sparkHome.map(
+  Seq(_, R, lib).mkString(File.separator)
+)
+  }
+
+  /**
+   * Get the SparkR package path in various deployment modes.
+   */
+  def sparkRPackagePath(driver: Boolean): String = {
+val yarnMode = sys.env.get(SPARK_YARN_MODE)
+if (!yarnMode.isEmpty  yarnMode.get == true 
+!(driver  System.getProperty(spark.master) == yarn-client)) {
+  // For workers in YARN modes and driver in yarn cluster mode,
+  // the SparkR package distributed as an archive resource should be 
pointed to
+  // by a symbol link sparkr in the current directory.
+  new File(sparkr).getAbsolutePath
+} else {
+  // TBD: add support for MESOS
+  val rPackagePath = localSparkRPackagePath
--- End diff --

doesn't seem like `localSparkRPackagePath` will ever return empty because 
of this `map` call?
```
sparkHome.map(
  Seq(_, R, lib).mkString(File.separator)
)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9317] [SPARKR] Change `show` to print D...

2015-08-21 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8360#issuecomment-133523560
  
From what I can infer from the original JIRA is that we are trying to match 
R data.frame behavior.
I think it is handy, though it is easy to think of several alternative ways 
to do this (head(df), showDF(df)) but those will need to be learned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9317] [SPARKR] Change `show` to print D...

2015-08-21 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/8360

[SPARK-9317] [SPARKR] Change `show` to print DataFrame entries

Small update to DataFrame API in SparkR

@shivaram

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rshow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8360


commit e3ae104dc1ee47c359b25c21ba56c96022a17558
Author: felixcheung felixcheun...@hotmail.com
Date:   2015-08-21T17:10:14Z

[SPARK-9317] [SPARKR] Change `show` to print DataFrame entries




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9316] [SPARKR] Add support for filterin...

2015-08-24 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/8394

[SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for 
filter / select)

Add support for
```
   df[df$name == Smith, c(1,2)]
   df[df$age %in% c(19, 30), 1:2]
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rsubset

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8394.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8394


commit 99109c4b592c57d1ba00a002b3d4b71ece10f954
Author: felixcheung felixcheun...@hotmail.com
Date:   2015-08-23T10:06:33Z

R: Add support for subsetting + tests

commit 42e881a4e4c7d3e669400b44ec24c4af1e10f6da
Author: felixcheung felixcheun...@hotmail.com
Date:   2015-08-24T07:46:26Z

add support for d[d$something0,], more tests

commit 16e0ba375ac12788478002ce96ca74206c2d437a
Author: felixcheung felixcheun...@hotmail.com
Date:   2015-08-24T07:47:15Z

update example




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...

2015-07-29 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/7742#discussion_r35783192
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -69,8 +69,11 @@ private[r] class RBackendHandler(server: RBackend)
 case e: Exception =
   logError(sRemoving $objId failed, e)
   writeInt(dos, -1)
+  writeString(dos, sRemoving $objId failed: ${e.getMessage})
   }
-case _ = dos.writeInt(-1)
+case _ =
+  dos.writeInt(-1)
+  writeString(dos, Unknown error)
--- End diff --

Isn't this an unknown method call? Should this say error unknown method 
or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...

2015-07-29 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/7742#discussion_r35783283
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -148,6 +151,9 @@ private[r] class RBackendHandler(server: RBackend)
   case e: Exception =
 logError(s$methodName on $objId failed, e)
 writeInt(dos, -1)
+// Writing the error message of the cause for the exception. This 
will be returned
+// to user in the R process.
+writeString(dos, e.getCause.getMessage)
--- End diff --

would it make sense to write the stack too? often time it is much more 
useful to have the call stack in additional to the exception message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...

2015-07-30 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/7742#issuecomment-126209438
  
looks good! thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8742][SPARKR] Improve SparkR error mess...

2015-07-30 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/7742#discussion_r35901689
  
--- Diff: core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala 
---
@@ -148,6 +151,9 @@ private[r] class RBackendHandler(server: RBackend)
   case e: Exception =
 logError(s$methodName on $objId failed, e)
--- End diff --

good point, this is logging the InvocationTargetException too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10971][SPARKR] RRunner should allow set...

2015-10-22 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9179#issuecomment-150311495
  
+1 on `spark.r.driver.command` and `spark.r.command`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARKR] [SPARK-11199] Improve R context manag...

2015-10-22 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9185#issuecomment-150318041
  
I vote for simplicity for SparkR and not have multiple session.
In fact I observe it is already messy to handle DataFrame created by a 
different SparkContext (after stop() and init()). I would argue these concepts 
do not translate well to R - for which for the most part 'session' == 'process'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-23 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9218#discussion_r42899259
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -276,6 +276,57 @@ setMethod("names<-",
 }
   })
 
+#' @rdname columns
+#' @name colnames
+setMethod("colnames",
+  signature(x = "DataFrame"),
+  function(x) {
+columns(x)
+  })
+
+#' @rdname columns
+#' @name colnames<-
+setMethod("colnames<-",
+  signature(x = "DataFrame", value = "character"),
+  function(x, value) {
+sdf <- callJMethod(x@sdf, "toDF", as.list(value))
+dataFrame(sdf)
+  })
+
+#' coltypes
+#'
+#' Set the column types of a DataFrame.
+#'
+#' @name coltypes
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the target column 
types for the given DataFrame
+#' @rdname coltypes
+#' @aliases coltypes
+#' @export
+#' @examples
+#'\dontrun{
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' path <- "path/to/file.json"
+#' df <- jsonFile(sqlContext, path)
+#' coltypes(df) <- c("string", "integer")
+#'}
+setMethod("coltypes<-",
--- End diff --

That's correct. I'm hoping #8984 can be merged soon so I could add a new 
reverse mapping in the same place. I could make this [WIP] if you'd like


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...

2015-10-22 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9192#discussion_r42716848
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -17,6 +17,34 @@
 
 # SQLcontext.R: SQLContext-driven functions
 
+#' Temporary function to reroute old S3 Method call to new
+#' We need to check the class of x to ensure it is SQLContext before 
dispatching
+dispatchFunc <- function(newFuncSig, x, ...) {
+  funcName <- as.character(sys.call(sys.parent())[[1]])
+  f <- get0(paste0(funcName, ".default"))
--- End diff --

changed. thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-22 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9218

[SPARK-9319][SPARKR] Add support for setting column names, types

Add support for for colnames, colnames<-, coltypes<-

I will merge with PR 8984 (coltypes) once it is in, possibly looking into 
mapping R type names.

@shivaram @sun-rui 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark colnamescoltypes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9218


commit 071f29f998f86a5a05744c703bf6a9a2384c3805
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-10-22T06:48:43Z

Add support for colnames, colnames<-, coltypes<-




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-22 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9218#discussion_r42786136
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -276,6 +276,57 @@ setMethod("names<-",
 }
   })
 
+#' @rdname columns
+#' @name colnames
--- End diff --

In R doc, it will be included under the `columns` page because of `@rdname` 
notation. https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-22 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9218#issuecomment-150311963
  
@sun-rui `names` and `names<-` are already there, this is to add `colnames`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...

2015-10-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9192#discussion_r42664614
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -17,6 +17,34 @@
 
 # SQLcontext.R: SQLContext-driven functions
 
+#' Temporary function to reroute old S3 Method call to new
+#' We need to check the class of x to ensure it is SQLContext before 
dispatching
+dispatchFunc <- function(newFuncSig, x, ...) {
+  funcName <- as.character(sys.call(sys.parent())[[1]])
+  f <- get0(paste0(funcName, ".default"))
+  # Strip sqlContext from list of parameters and then pass the rest along.
+  # In the following, if '&' is used instead of '&&', it warns about
+  # "the condition has length > 1 and only the first element will be used"
+  if (class(x) == "jobj" &&
+  grepl("org.apache.spark.sql.SQLContext", capture.output(show(x {
--- End diff --

updated. thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...

2015-10-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9192#discussion_r42662379
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -17,6 +17,34 @@
 
 # SQLcontext.R: SQLContext-driven functions
 
+#' Temporary function to reroute old S3 Method call to new
+#' We need to check the class of x to ensure it is SQLContext before 
dispatching
+dispatchFunc <- function(newFuncSig, x, ...) {
+  funcName <- as.character(sys.call(sys.parent())[[1]])
+  f <- get0(paste0(funcName, ".default"))
+  # Strip sqlContext from list of parameters and then pass the rest along.
+  # In the following, if '&' is used instead of '&&', it warns about
+  # "the condition has length > 1 and only the first element will be used"
+  if (class(x) == "jobj" &&
+  grepl("org.apache.spark.sql.SQLContext", capture.output(show(x {
+.Deprecated(newFuncSig, old = paste0(funcName, "(sqlContext...)"))
+f(...)
+  } else {
+f(x, ...)
+  }
+}
--- End diff --

The proposal for this is to eliminate the sqlContext parameter from 
SQLContext-parity methods in R. Primarily this makes methods friendlier in R 
and more R-like (eg. read.df()). The changed method signature would be the one 
we would like to keep in the next release.

Reasons for this have been discussed in JIRA, but to recap:
1. We only support one sqlContext in R - and having multiple at a time can 
be very confusing (eg. table not accessible)
2. For hiveCtx vs sqlContext, hiveCtx is preferred


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...

2015-10-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9192#discussion_r42659643
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -17,6 +17,34 @@
 
 # SQLcontext.R: SQLContext-driven functions
 
+#' Temporary function to reroute old S3 Method call to new
--- End diff --

"reroute" was the term corresponding to "dispatch"
"temporary" was referring to the fact that we intend this to go away - 
please see my other answer regarding your question on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10903] [SPARKR] R - Simplify SQLContext...

2015-10-21 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9192#discussion_r42659746
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -17,6 +17,34 @@
 
 # SQLcontext.R: SQLContext-driven functions
 
+#' Temporary function to reroute old S3 Method call to new
+#' We need to check the class of x to ensure it is SQLContext before 
dispatching
+dispatchFunc <- function(newFuncSig, x, ...) {
+  funcName <- as.character(sys.call(sys.parent())[[1]])
+  f <- get0(paste0(funcName, ".default"))
--- End diff --

get0 is in {base} right? 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/exists.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-22 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9218#discussion_r42801664
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -276,6 +276,57 @@ setMethod("names<-",
 }
   })
 
+#' @rdname columns
+#' @name colnames
+setMethod("colnames",
+  signature(x = "DataFrame"),
+  function(x) {
+columns(x)
+  })
+
+#' @rdname columns
+#' @name colnames<-
+setMethod("colnames<-",
+  signature(x = "DataFrame", value = "character"),
+  function(x, value) {
+sdf <- callJMethod(x@sdf, "toDF", as.list(value))
+dataFrame(sdf)
+  })
+
+#' coltypes
+#'
+#' Set the column types of a DataFrame.
+#'
+#' @name coltypes
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the target column 
types for the given DataFrame
+#' @rdname coltypes
+#' @aliases coltypes
+#' @export
+#' @examples
+#'\dontrun{
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' path <- "path/to/file.json"
+#' df <- jsonFile(sqlContext, path)
+#' coltypes(df) <- c("string", "integer")
+#'}
+setMethod("coltypes<-",
--- End diff --

Certainly, it is in  PR 8984 by @olarayej



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11258 Remove quadratic runtime complexit...

2015-10-22 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9222#issuecomment-150382796
  
Do you have benchmark numbers for this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8277][SPARKR] Faster createDataFrame us...

2015-10-22 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9234#issuecomment-150382185
  
Hi thanks for the contribution, you might want to check out the ongoing 
work in https://github.com/apache/spark/pull/9099
and SPARK-11086


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-26 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9290

[SPARK-11340][SPARKR] Support setting driver properties when starting Spark 
from R programmatically or from RStudio

Mapping spark.driver.memory from sparkEnvir to spark-submit commandline 
arguments.

@shivaram suggested that we possibly add other spark.driver.* properties - 
do we want to add all of those? I thought those could be set in SparkConf?
@sun-rui 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rdrivermem

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9290


commit d3f8d280098f42615c4d63d64d8797c8c76a8970
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-10-27T05:07:49Z

Support setting spark.driver.memory from sparkEnvir when launching JVM 
backend




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-26 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9290#issuecomment-151377437
  
Manual testing with:

```
library(SparkR, lib.loc='/opt/spark-1.6.0-bin-hadoop2.6/R/lib')
sc <- sparkR.init(master = "local[*]", sparkEnvir = 
list(spark.driver.memory = "2g"))
```

before

![image](https://cloud.githubusercontent.com/assets/8969467/10750094/904518ee-7c2e-11e5-8800-c67d45b13183.png)

after

![image](https://cloud.githubusercontent.com/assets/8969467/10750097/960b95be-7c2e-11e5-9669-53b6a3fc7665.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11210][SPARKR] Add window functions int...

2015-10-26 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9196#issuecomment-151381136
  
looks good!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-26 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9290#issuecomment-151378899
  
I checked, the user could also set SPARK_DRIVER_MEMORY before running 
`sparkR.init()`

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L157


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-26 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9218#issuecomment-151379530
  
@sun-rui That's a great point, `coltypes()` as its signature is defined, 
would only return a list of simple types.
But how would one create a DataFrame with complex type from R? I tried a 
bit and couldn't get it to work. Either I get `Unsupported type for DataFrame: 
factor` or `unexpected type: environment` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-29 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9290#discussion_r43430720
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -93,7 +93,7 @@ sparkR.stop <- function() {
 #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark",
 #'  list(spark.executor.memory="1g"))
 #' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark",
-#'  list(spark.executor.memory="1g"),
+#'  list(spark.executor.memory="4g", 
spark.driver.memory="2g"),
--- End diff --

updated. created JIRA for additional programming guide change: 
https://issues.apache.org/jira/browse/SPARK-11407

I could take a shot at the doc change if you'd like 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8019] [SPARKR] Support SparkR spawning ...

2015-10-29 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/6557#issuecomment-152292218
  
This is updated by #9179


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11409][SPARKR] Enable url link in R doc...

2015-10-29 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9363

[SPARK-11409][SPARKR] Enable url link in R doc for Persist

Quick one line doc fix
link is not clickable

![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png)

@shivaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rpersistdoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9363.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9363


commit 808cbac97a66cbf8453b377baf26571a7ccbe707
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-10-29T21:48:31Z

Enable url link in R doc for Persist




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11294][SPARKR] Improve R doc for read.d...

2015-10-23 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9261

[SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable

Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix 
formatting and text truncation for write.df, saveAsTable.

Several text issues:

![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png)
- text collapsed into a single paragraph
- text truncated at 2 places, eg. "overwrite: Existing data is expected to 
be overwritten by the contents of error:"

@shivaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rdocreadwritedf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9261.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9261


commit edd58ef8f2aa64dcb3e5c6a777bffcb74b255cec
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-10-24T03:16:00Z

Add example for R read.df, write.df; fix formatting and text truncation for 
write.df, saveAsTable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-10-27 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-151611692
  
@shivaram as I was discussing with @sun-rui in #9218 - I think coltypes() 
could probably handle complex type (JVM -> R) by mapping "map<string,int>" -> 
"environment", "array" -> "list" but this conversion is not perfect, at 
least from R -> JVM.

I can have this
```
> e <- new.env()
> e[["abd"]] <- 1276
> e[["84798"]] <- "abc"
> l <- list(e)
> df <- createDataFrame(sqlContext, list(l))
> df
DataFrame[_1:map<string,string>]
```

So `env` supports mixed type values they are mapped to `map<string, 
string>` on JVM.

In fact, this DataFrame doesn't seem to work properly
```
> head(df)
15/10/27 19:02:02 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
scala.MatchError: 1276.0 (of class java.lang.Double)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11343] [ML] Allow float and double pred...

2015-10-27 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9296#discussion_r43195310
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala 
---
@@ -72,10 +73,13 @@ final class RegressionEvaluator @Since("1.4.0") 
(@Since("1.4.0") override val ui
   @Since("1.4.0")
   override def evaluate(dataset: DataFrame): Double = {
 val schema = dataset.schema
-SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
-SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType)
+val predictionType = schema($(predictionCol)).dataType
+require(predictionType == FloatType || predictionType == DoubleType)
--- End diff --

should we add a message to require()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-10-27 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-151654117
  
The error looks different, possibly related but not exactly same cause


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11215] [ML] Add multiple columns suppor...

2015-10-28 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9183#discussion_r43221520
  
--- Diff: R/pkg/inst/tests/test_mllib.R ---
@@ -56,14 +56,3 @@ test_that("feature interaction vs native glm", {
   rVals <- predict(glm(Sepal.Width ~ Species:Sepal.Length, data = iris), 
iris)
   expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
 })
-
-test_that("summary coefficients match with native glm", {
--- End diff --

why is this removed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-10-28 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-152069034
  
I'm concerned with the lack of "reversibility" as in `coltypes(x) <- 
coltypes(x)`
Could I propose expanding on your suggestion to return `na` for non-atomic 
type?
`coltypes<-()` can skip over `na`
`coltypes()` can have an optional argument default atomic.only = TRUE - 
which can be set to FALSE to get `list`, `environment`, `struct` etc  for 
non-atomic vector types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11329] [SQL] Support star expansion for...

2015-10-28 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9343#discussion_r43342070
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -146,7 +146,11 @@ case class Alias(child: Expression, name: String)(
 
   override def toAttribute: Attribute = {
 if (resolved) {
-  AttributeReference(name, child.dataType, child.nullable, 
metadata)(exprId, qualifiers)
+  // Append the name of the alias as a qualifier. This lets us resolve 
things like:
+  // (SELECT struct(a,b) AS x FROM ...).SELECT x.*
+  // TODO: is this the best way to do this? Should Alias just have 
nameas the qualifier?
--- End diff --

'nameas' -> 'name as'?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-28 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9290#discussion_r43346365
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -93,7 +93,7 @@ sparkR.stop <- function() {
 #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark",
 #'  list(spark.executor.memory="1g"))
 #' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark",
-#'  list(spark.executor.memory="1g"),
+#'  list(spark.executor.memory="4g", 
spark.driver.memory="2g"),
--- End diff --

I could take this out - this is more for API doc/roxygen. If we are moving 
it out it would be visible only when someone is reading the code. Maybe 
pointing to the SparkR programming guide is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11210][SPARKR][WIP] Add window function...

2015-10-21 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9196#issuecomment-150021663
  
This is merging 2 PR/JIRA?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11284] [ML] ALS produces float predicti...

2015-10-23 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9252#issuecomment-150700335
  
Shouldn't this be fixed/casted in the `RegressionEvaluator` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...

2015-11-10 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9463#issuecomment-155598069
  
@shivaram ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...

2015-11-10 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9489#discussion_r44482198
  
--- Diff: R/pkg/R/functions.R ---
@@ -974,6 +1006,54 @@ setMethod("soundex",
 column(jc)
   })
 
+#' stddev
+#'
+#' Aggregate function: alias for \link{stddev_samp}
+#'
+#' @rdname stddev
+#' @name stddev
+#' @family agg_funcs
+#' @export
+#' @examples \dontrun{stddev(df$c)}
+setMethod("stddev",
+  signature(x = "Column"),
+  function(x) {
--- End diff --

This is an alias on the Scala/Spark side


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...

2015-11-10 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9489#discussion_r44482223
  
--- Diff: R/pkg/R/functions.R ---
@@ -1168,6 +1248,54 @@ setMethod("upper",
 column(jc)
   })
 
+#' variance
+#'
+#' Aggregate function: alias for \link{var_samp}.
+#'
+#' @rdname variance
+#' @name variance
+#' @family agg_funcs
+#' @export
+#' @examples \dontrun{variance(df$c)}
+setMethod("variance",
+  signature(x = "Column"),
+  function(x) {
--- End diff --

Same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11567][PYTHON] Add Python API for corr ...

2015-11-10 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9536#issuecomment-155597890
  
@davis?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...

2015-11-09 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9579#issuecomment-155258819
  
doc comment, looks good to me otherwise.
@sun-rui @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...

2015-11-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9579#discussion_r44361878
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2152,3 +2152,47 @@ setMethod("with",
 newEnv <- assignNewEnv(data)
 eval(substitute(expr), envir = newEnv, enclos = newEnv)
   })
+
+#' Returns the column types of a DataFrame.
+#' 
+#' @name coltypes
+#' @title Get column types of a DataFrame
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the column types of 
the given DataFrame
+#' @rdname coltypes
--- End diff --

Please add family
```
#' @family dataframe_funcs
```

and example, like
```
#' @examples
#' \dontrun{
#' with(irisDf, nrow(Sepal_Width))
#' }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Flaky SparkR test: test_sparkSQL.R: sample on ...

2015-11-08 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9549#issuecomment-154880937
  
@shivaram @adrian555 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Flaky SparkR test: test_sparkSQL.R: sample on ...

2015-11-08 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9549

Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame

Make sample test less flaky by setting the seed

Tested with
```
repeat {  if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rsample

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9549.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9549


commit 6869ba1acce68d92e0b517ef19e376f94f0d8d9a
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-08T22:11:46Z

Make sample test less flaky




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...

2015-11-11 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9380#issuecomment-155970380
  
These SparkR support has just been added so this change breaks tests
```
1. Failure (at test_sparkSQL.R#1010): group by, agg functions 
--
0 not equal to df3_local[df3_local$name == "Andy", ][1, 2]
NaN - 0 == NaN

2. Failure (at test_sparkSQL.R#1041): group by, agg functions 
--
0 not equal to df7_local[df7_local$name == "ID2", ][1, 2]
NaN - 0 == NaN
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-11-11 Thread felixcheung

Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/9218


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-11-11 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9654

[SPARK-9319][SPARKR] Add support for setting column names, types

Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the 
PR. Recreated it here. Was #9218 

@shivaram @sun-rui

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark colnamescoltypes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9654


commit 7eb4815ffaf62dd7678b81f7aaf1bb49878a9303
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-10-22T06:48:43Z

Add support for colnames, colnames<-, coltypes<-

commit 009d755c43357b8e7839783e5975fa40cc1b1b66
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-01T20:33:28Z

Take R types instead to map to JVM types, add check for NA to keep column

commit 44662e62be27b55c33c04fae7b08ad1dc52a7c48
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-01T21:27:10Z

This seems to fix the Rd error - no idea why it worked before.

commit e846f9df26f593f70a0b140cd8226dec950802ff
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-02T00:33:40Z

fix test broken from column name change from cast

commit c102bc0f3436d5840f706f38bd1882fba382e088
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-12T06:40:08Z

rebase, merge with coltypes change, fix generic, doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10500][SPARKR] sparkr.zip cannot be cre...

2015-11-11 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9390#issuecomment-156024874
  
I don't fully understand the issue, but why we have to use `.libPaths` and 
not `library(... lib.loc= )`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...

2015-11-11 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9380#issuecomment-156024999
  
@JihongMA yap that should fix them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44734240
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,107 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
-  })
\ No newline at end of file
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str", signature="DataFrame", definition=
+function(object) {
--- End diff --

This is a little odd, try with the same format of coltypes:
```
 setMethod("coltypes",
   signature(x = "DataFrame"),
   function(x) {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44734873
  
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -1525,6 +1525,22 @@ test_that("Method coltypes() to get R's data types 
of a DataFrame", {
   expect_equal(coltypes(x), "map<string,string>")
 })
 
+test_that("Method str()", {
+  # Structure of Iiris
+  iris2 <- iris
+  iris2$col <- TRUE
+  irisDF2 <- createDataFrame(sqlContext, iris2)
+  out <- capture.output(str(irisDF2))
+  expect_equal(length(out), 7)
+
+  # A random dataset with many columns
+  x <- runif(200, 1, 10)
+  df <- data.frame(t(as.matrix(data.frame(x,x,x,x,x,x,x,x,x
+  DF <- createDataFrame(sqlContext, df)
+  out <- capture.output(str(DF))
+  expect_equal(length(out), 103)
--- End diff --

could you please check for some specific values/text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44734936
  
--- Diff: R/pkg/R/generics.R ---
@@ -1050,4 +1049,7 @@ setGeneric("with")
 
 #' @rdname coltypes
 #' @export
-setGeneric("coltypes", function(x) { standardGeneric("coltypes") })
\ No newline at end of file
+setGeneric("coltypes", function(x) { standardGeneric("coltypes") })
+
+#' @export
+setGeneric("str")
--- End diff --

this should be sorted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44735202
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,107 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
-  })
\ No newline at end of file
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str", signature="DataFrame", definition=
+function(object) {
+
+  # A synonym for easily concatenating strings
+  "%++%" <- function(x, y) {
+paste(x, y, sep = "")
+  }
+
+  # TODO: These could be made global parameters, though in R 
it's not the case
+  DEFAULT_HEAD_ROWS <- 6
+  MAX_CHAR_PER_ROW <- 120
+  MAX_COLS <- 100
+
+  # Get the column names and types of the DataFrame
+  names <- names(object)
+  types <- coltypes(object)
+
+  # Get the number of rows.
+  # TODO: Ideally, this should be cached
+  cachedCount <- nrow(object)
+
+  # Get the first elements of the dataset. Limit number of 
columns accordingly
+  dataFrame <- if (ncol(object) > MAX_COLS) {
+ head(object[, c(1:MAX_COLS)], 
DEFAULT_HEAD_ROWS)
+   } else {
+ head(object, DEFAULT_HEAD_ROWS)
+   }
+
+  # The number of observations will be displayed only if the 
number
+  # of rows of the dataset has already been cached.
+  if (!is.null(cachedCount)) {
+cat("'" %++% class(object) %++% "': " %++% cachedCount 
%++% " obs. of " %++%
+  length(names) %++% " variables:\n")
+  } else {
+cat("'" %++% class(object) %++% "': " %++% length(names) 
%++% " variables:\n")
+  }
+
+  # Whether the ... should be printed at the end of each row
+  ellipsis <- FALSE
+
+  # Add ellipsis (i.e., "...") if there are more rows than 
shown
+  if (!is.null(cachedCount)) {
+if (nrow(object) > DEFAULT_HEAD_ROWS) {
+  ellipsis <- TRUE
+}
+  }
+
+  if (nrow(dataFrame) > 0) {
+for (i in 1 : ncol(dataFrame)) {
+  firstElements <- ""
+
+  # Get the first elements for each column
+  if (types[i] == "chr") {
+firstElements <- paste("\"" %++% dataFrame[,i] %++% 
"\"", collapse = " ")
+  } else {
+firstElements <- paste(dataFrame[,i], collapse = " ")
+  }
+
+  # Add the corresponding number of spaces for alignment
+  spaces <- paste(rep(" ", max(nchar(names) - 
nchar(names[i]))), collapse="")
+
+  # Get the short type. For 'character', it would be 'chr';
+  # 'for numeric', it's 'num', etc.
+  dataType <- SHORT_TYPES[[types[i]]]
+  if (is.null(dataType)) {
+dataType <- substring(types[i], 1, 3)
+  }
+
+  # Concatenate the colnames, coltypes, and first
+  # elements of each column
+  line <- " $ " %++% names[i] %++% spaces %++% ": " %++%
+dataType %++% " " %++% firstElements
+
+  # Chop off extra characters if this is too long
+  cat(substr(line, 1, MAX_CHAR_PER_ROW))
--- End diff --

do we need to chop off extra to fit in 4 char in " ..."?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as

[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...

2015-11-12 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9463#issuecomment-156289622
  
thanks for catching that, I did some test I suspect they are caused by 
generics.R, so removing those.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-12 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9680

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration

Need to match existing method signature

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rcorr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9680


commit 68f1254ba525297857b0fc6959ca5d54c1509af9
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-13T00:18:28Z

Add R support corr for Column Aggregration

commit 940164c32aa75f3c74bdb966ecac69df602b5aa2
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-13T00:20:22Z

fix doc text




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44734132
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,107 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
-  })
\ No newline at end of file
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str", signature="DataFrame", definition=
+function(object) {
+
+  # A synonym for easily concatenating strings
+  "%++%" <- function(x, y) {
+paste(x, y, sep = "")
+  }
+
+  # TODO: These could be made global parameters, though in R 
it's not the case
+  DEFAULT_HEAD_ROWS <- 6
+  MAX_CHAR_PER_ROW <- 120
+  MAX_COLS <- 100
+
+  # Get the column names and types of the DataFrame
+  names <- names(object)
+  types <- coltypes(object)
+
+  # Get the number of rows.
+  # TODO: Ideally, this should be cached
+  cachedCount <- nrow(object)
+
+  # Get the first elements of the dataset. Limit number of 
columns accordingly
+  dataFrame <- if (ncol(object) > MAX_COLS) {
+ head(object[, c(1:MAX_COLS)], 
DEFAULT_HEAD_ROWS)
+   } else {
+ head(object, DEFAULT_HEAD_ROWS)
+   }
+
+  # The number of observations will be displayed only if the 
number
+  # of rows of the dataset has already been cached.
+  if (!is.null(cachedCount)) {
+cat("'" %++% class(object) %++% "': " %++% cachedCount 
%++% " obs. of " %++%
+  length(names) %++% " variables:\n")
+  } else {
+cat("'" %++% class(object) %++% "': " %++% length(names) 
%++% " variables:\n")
+  }
+
+  # Whether the ... should be printed at the end of each row
+  ellipsis <- FALSE
+
+  # Add ellipsis (i.e., "...") if there are more rows than 
shown
+  if (!is.null(cachedCount)) {
+if (nrow(object) > DEFAULT_HEAD_ROWS) {
--- End diff --

collapse to `if (!is.null(cachedCount) && nrow(object) > 
DEFAULT_HEAD_ROWS)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-12 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44734047
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,107 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
-  })
\ No newline at end of file
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str", signature="DataFrame", definition=
+function(object) {
+
+  # A synonym for easily concatenating strings
+  "%++%" <- function(x, y) {
+paste(x, y, sep = "")
--- End diff --

use `paste0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-12 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9680#issuecomment-156331258
  
I think 9366 is about computing corr or cov matrix whereas this is 
computing corr between two columns. They seem to be useful in their own ways. 
Also this is already supported in Scala and Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-13 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9680#issuecomment-156484659
  
So #9366 is for all columns in DataFrame x=y or different x, y DataFrames
And this #9680 is for 2 columns in one DataFrame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867038
  
--- Diff: R/pkg/R/generics.R ---
@@ -971,6 +986,9 @@ setGeneric("size", function(x) { 
standardGeneric("size") })
 #' @export
 setGeneric("soundex", function(x) { standardGeneric("soundex") })
 
+#' @export
--- End diff --

add `#' @rdname str`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867045
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
--- End diff --

please remove unneeded empty line after `\dontrun {` and before `}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867068
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+DEFAULT_HEAD_ROWS <- 6
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+dataFrame <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS)
+ } else {
+   head(object, DEFAULT_HEAD_ROWS)
+ }
--- End diff --

if you call `head(object)` it would return the first 6 rows by default, 
perhaps leave it to the default behavior instead of passing in 
`DEFAULT_HEAD_ROWS` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867029
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
+#' @examples \dontrun{
+#'
+#' # Create a DataFrame from the Iris dataset
+#' irisDF <- createDataFrame(sqlContext, iris)
+#' 
+#' # Show the structure of the DataFrame
+#' str(irisDF)
+#' 
+#' }
+setMethod("str",
+  signature(object = "DataFrame"),
+  function(object) {
+
+# TODO: These could be made global parameters, though in R 
it's not the case
+DEFAULT_HEAD_ROWS <- 6
+MAX_CHAR_PER_ROW <- 120
+MAX_COLS <- 100
+
+# Get the column names and types of the DataFrame
+names <- names(object)
+types <- coltypes(object)
+
+# Get the number of rows.
+# TODO: Ideally, this should be cached
+cachedCount <- nrow(object)
+
+# Get the first elements of the dataset. Limit number of 
columns accordingly
+dataFrame <- if (ncol(object) > MAX_COLS) {
+   head(object[, c(1:MAX_COLS)], DEFAULT_HEAD_ROWS)
+ } else {
+   head(object, DEFAULT_HEAD_ROWS)
+ }
+
+# The number of observations will be displayed only if the 
number
+# of rows of the dataset has already been cached.
+if (!is.null(cachedCount)) {
+  cat(paste0("'", class(object), "': ", cachedCount, " obs. of 
",
+length(names), " variables:\n"))
+} else {
+  cat(paste0("'", class(object), "': ", length(names), " 
variables:\n"))
+}
+
+# Whether the ... should be printed at the end of each row
+ellipsis <- FALSE
+
+# Add ellipsis (i.e., "...") if there are more rows than shown
+if (!is.null(cachedCount) && (cachedCount > 
DEFAULT_HEAD_ROWS)) {
+  ellipsis <- TRUE
+}
+
+if (nrow(dataFrame) > 0) {
+  for (i in 1 : ncol(dataFrame)) {
+firstElements <- ""
+
+# Get the first elements for each column
+if (types[i] == "chr") {
--- End diff --

I understand that, the check on SHORT_TYPES is below in line 2278 though?
here, `types` is still "character" right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867049
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
--- End diff --

this should be `#' @rdname str`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867056
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
+#' @param x a DataFrame
--- End diff --

replace `x` with `object` to match the signature below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...

2015-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9613#discussion_r44867054
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2200,4 +2200,101 @@ setMethod("coltypes",
 rTypes[naIndices] <- types[naIndices]
 
 rTypes
+  })
+
+#' Display the structure of a DataFrame, including column names, column 
types, as well as a
+#' a small sample of rows.
+#' @name str
+#' @title Compactly display the structure of a dataset
+#' @rdname str_data_frame
+#' @family dataframe_funcs
--- End diff --

This has been updated recently - you should see when you rebase the latest 
in master - it would be `#' @family DataFrame functions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11684] [R] [ML] [Doc] Update SparkR glm...

2015-11-16 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9727#issuecomment-157179595
  
looks good @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-16 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9680#discussion_r44974143
  
--- Diff: R/pkg/R/functions.R ---
@@ -259,6 +259,20 @@ setMethod("column",
   function(x) {
 col(x)
   })
+#' corr
+#'
+#' Computes the Pearson Correlation Coefficient for two Columns.
+#'
+#' @rdname corr
+#' @name corr
+#' @family math_funcs
+#' @export
+#' @examples \dontrun{corr(df$c, df$d)}
+setMethod("corr", signature(x = "Column", col1 = "Column", col2 = 
"missing", method = "missing"),
--- End diff --

as for doc, the DataFrame corr in stats.R has `@rdname statfunctions`
this one has `@rdname corr`

so they go to different HTML page generated by roxygen2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-16 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9680#discussion_r44973809
  
--- Diff: R/pkg/R/functions.R ---
@@ -259,6 +259,20 @@ setMethod("column",
   function(x) {
 col(x)
   })
+#' corr
+#'
+#' Computes the Pearson Correlation Coefficient for two Columns.
+#'
+#' @rdname corr
+#' @name corr
+#' @family math_funcs
+#' @export
+#' @examples \dontrun{corr(df$c, df$d)}
+setMethod("corr", signature(x = "Column", col1 = "Column", col2 = 
"missing", method = "missing"),
--- End diff --

right, I like the approach of changing the existing generic definition.

perhaps we should align the method signature with the stats::cor
```
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
```

do you know why we decide to name it `corr` (vs. `cor`) in other places?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11756][SPARKR] Fix use of aliases - Spa...

2015-11-16 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9750

[SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help 
information for SparkR:::summary correctly

Fix use of aliases and changes uses of @rdname and @seealso
`@aliases` is the hint for `?` - it should not be linked to some other name 
- those should be @seealso
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

Clean up usage on @family, as multiple use of @family with the same @rdname 
is causing duplicated See Also html blocks 
Also changing some @rdname for dplyr-like variant for better R user 
visibility in R doc, eg. rbind, summary, mutate, summarize

@shivaram @yanboliang 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rdocaliases

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9750


commit b2b0c3f7d290c9fa54b11fae2b5782172ddc8c78
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-17T00:22:32Z

Fix use of aliases and changes uses of @rdname and @seealso

commit 474d6af0533b56165a8189f16eda2bf6e1f21ee9
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-17T00:30:34Z

minor typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-16 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9680#discussion_r45019825
  
--- Diff: R/pkg/R/functions.R ---
@@ -259,6 +259,20 @@ setMethod("column",
   function(x) {
 col(x)
   })
+#' corr
+#'
+#' Computes the Pearson Correlation Coefficient for two Columns.
+#'
+#' @rdname corr
+#' @name corr
+#' @family math_funcs
+#' @export
+#' @examples \dontrun{corr(df$c, df$d)}
+setMethod("corr", signature(x = "Column", col1 = "Column", col2 = 
"missing", method = "missing"),
--- End diff --

Thinking more about this, I think what's being added in #9366 matches 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html better. When 
we are adding that support in R we could add it as `cor` matching stats::cor.

Meanwhile I'll change `corr` to what you suggested with `function(x, ...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11715][SPARKR] Add R support corr for C...

2015-11-16 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9680#issuecomment-157268953
  
@sun-rui I updated it. I think it's a bit not as strongly typed as I'd like 
but if I add `col2 = "Column"` to signature I get this error:
```
Error in match.call(definition, call, expand.dots, envir) :
  unused argument (col2 = c("Column", ""))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...

2015-11-09 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9489#issuecomment-155244366
  
@mengxr possibly.. though there are usage difference with SparkR 
DataFrame/Column as compared to  R data.frame (eg. `agg` vs 
[`aggregate`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html),
 `var(faithful$eruptions)` would have been a Column with DataFrame)
It might be more confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() (New v...

2015-11-09 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9579#discussion_r44359782
  
--- Diff: R/pkg/R/schema.R ---
@@ -115,20 +115,7 @@ structField.jobj <- function(x) {
 }
 
 checkType <- function(type) {
-  primtiveTypes <- c("byte",
- "integer",
- "float",
- "double",
- "numeric",
- "character",
- "string",
- "binary",
- "raw",
- "logical",
- "boolean",
- "timestamp",
- "date")
-  if (type %in% primtiveTypes) {
+  if (type %in% names(PRIMITIVE_TYPES)) {
--- End diff --

To avoid search, `if (!is.null(PRIMITIVE_TYPES[[type]]))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc

2015-11-02 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9394#issuecomment-153241163
  
it should be `@seealso \link{createDataFrame}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-11-02 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9218#discussion_r43718136
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -276,6 +276,75 @@ setMethod("names<-",
 }
   })
 
+#' @rdname columns
+#' @name colnames
+setMethod("colnames",
+  signature(x = "DataFrame"),
+  function(x) {
+columns(x)
+  })
+
+#' @rdname columns
+#' @name colnames<-
+setMethod("colnames<-",
+  signature(x = "DataFrame", value = "character"),
+  function(x, value) {
+sdf <- callJMethod(x@sdf, "toDF", as.list(value))
+dataFrame(sdf)
+  })
+
+rToScalaTypes <- new.env()
+rToScalaTypes[["integer"]]   <- "integer" # in R, integer is 32bit
+rToScalaTypes[["numeric"]]   <- "double"  # in R, numeric == double which 
is 64bit
+rToScalaTypes[["double"]]<- "double"
+rToScalaTypes[["character"]] <- "string"
+rToScalaTypes[["logical"]]   <- "boolean"
+
+#' coltypes
--- End diff --

that's the plan too. could we merge #8984 now? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-11-02 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9218#discussion_r43718270
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -276,6 +276,75 @@ setMethod("names<-",
 }
   })
 
+#' @rdname columns
+#' @name colnames
+setMethod("colnames",
+  signature(x = "DataFrame"),
+  function(x) {
+columns(x)
+  })
+
+#' @rdname columns
+#' @name colnames<-
+setMethod("colnames<-",
+  signature(x = "DataFrame", value = "character"),
+  function(x, value) {
+sdf <- callJMethod(x@sdf, "toDF", as.list(value))
+dataFrame(sdf)
+  })
+
+rToScalaTypes <- new.env()
+rToScalaTypes[["integer"]]   <- "integer" # in R, integer is 32bit
+rToScalaTypes[["numeric"]]   <- "double"  # in R, numeric == double which 
is 64bit
+rToScalaTypes[["double"]]<- "double"
+rToScalaTypes[["character"]] <- "string"
+rToScalaTypes[["logical"]]   <- "boolean"
+
+#' coltypes
+#'
+#' Set the column types of a DataFrame.
+#'
+#' @name coltypes
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the target column 
types for the given
+#'DataFrame. Column types can be one of integer, numeric/double, 
character, logical, or NA
+#'to keep that column as-is.
+#' @rdname coltypes
+#' @aliases coltypes
+#' @export
+#' @examples
+#'\dontrun{
+#' sc <- sparkR.init()
+#' sqlContext <- sparkRSQL.init(sc)
+#' path <- "path/to/file.json"
+#' df <- jsonFile(sqlContext, path)
+#' coltypes(df) <- c("character", "integer")
+#' coltypes(df) <- c(NA, "numeric")
+#'}
+setMethod("coltypes<-",
+  signature(x = "DataFrame", value = "character"),
+  function(x, value) {
+cols <- columns(x)
+ncols <- length(cols)
+if (length(value) == 0 || length(value) != ncols) {
--- End diff --

I agree, that's why we should check for length = 0. I am not sure about 
supporting it though since `emptyDataFrame` is not callable from R, AFAIK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11407][SPARKR] Add doc for running from...

2015-11-01 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9401#issuecomment-152870335
  
And generally people blog about using `.libPath` but this could cause all 
packages to be installed to SparkR location as it becomes the default:
```
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> install.packages("lubridate")
Installing package into â/opt/spark-1.6.0-bin-hadoop2.6/R/libâ
(as âlibâ is unspecified)
```

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/install.packages.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc

2015-11-01 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9394#issuecomment-152870465
  
why don't we point to http://spark.apache.org/docs/latest/sparkr.html for 
now instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11407][SPARKR] Add doc for running from...

2015-11-01 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9401

[SPARK-11407][SPARKR] Add doc for running from RStudio


![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png)
(This is without css, but you get the idea)
@shivaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rstudioprogrammingguide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9401.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9401


commit a0485741f86656c0f4c5a588fd69598f04f49cd1
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-01T22:22:28Z

Add doc for running from RStudio




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-11-01 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9218#issuecomment-152862908
  
```
Error : 
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/man/colnames.Rd: Sections 
\title, and \name must exist and be unique in Rd files
ERROR: installing Rd objects failed for package 'SparkR'
```
This is odd, there shouldn't be a Rd file for colnames, I haven't changed 
this and it worked before
```
#' @rdname columns
#' @name colnames
setMethod("colnames",
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-04 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-153870143
  
I'm a bit confused - I thought `map` `array` `struct` should return NA?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11260][SPARKR] with() function support

2015-11-04 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/9443#discussion_r43940871
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2045,3 +2045,34 @@ setMethod("attach",
 }
 attach(newEnv, pos = pos, name = name, warn.conflicts = 
warn.conflicts)
   })
+
+#' Evaluate a R expression in an environment constructed from a DataFrame
+#' with() allows access to columns of a DataFrame by simply referring to
+#' their name. It appends every column of a DataFrame into a new
+#' environment. Then, the given expression is evaluated in this new
+#' environment.
+#'
+#' @rdname with
+#' @title Evaluate a R expression in an environment constructed from a 
DataFrame
+#' @param data (DataFrame) DataFrame to use for constructing an 
environment.
+#' @param expr (expression) Expression to evaluate.
+#' @param ... arguments to be passed to future methods.
+#' @examples
+#' \dontrun{
+#' with(irisDf, nrow(Sepal_Width))
+#' }
+#' @seealso \link{attach}
+setMethod("with",
+  signature(data = "DataFrame"),
+  function(data, expr, ...) {
+stopifnot(!missing(data))
+stopifnot(inherits(data, "DataFrame"))
+cols <- columns(data)
+stopifnot(length(cols) > 0)
+
+newEnv <- new.env()
+for (i in 1:length(cols)) {
+  assign(x = cols[i], value = data[, cols[i]], envir = newEnv)
+}
--- End diff --

typically we would add an internal S3 method in this file for helper 
function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-04 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/8984#issuecomment-153904230
  
looks good. thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-04 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8984#discussion_r43976192
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1914,3 +1914,46 @@ setMethod("attach",
 }
 attach(newEnv, pos = pos, name = name, warn.conflicts = 
warn.conflicts)
   })
+
+#' Returns the column types of a DataFrame.
+#' 
+#' @name coltypes
+#' @title Get column types of a DataFrame
+#' @param x (DataFrame)
+#' @return value (character) A character vector with the column types of 
the given DataFrame
+#' @rdname coltypes
+setMethod("coltypes",
+  signature(x = "DataFrame"),
+  function(x) {
+# Get the data types of the DataFrame by invoking dtypes() 
function
+types <- sapply(dtypes(x), function(x) {x[[2]]})
+
+# Map Spark data types into R's data types using DATA_TYPES 
environment
+rTypes <- sapply(types, USE.NAMES=F, FUN=function(x) {
+
+  # Check for primitive types
+  type <- PRIMITIVE_TYPES[[x]]
+  if (is.null(type)) {
+# Check for complex types
+for (t in names(COMPLEX_TYPES)) {
--- End diff --

Or
```
Filter(function(t) { grep("start", txt) == 1 }, names(COMPLEX_TYPES))
```
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11086][SPARKR] Use dropFactors column-w...

2015-11-04 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9099#issuecomment-153938620
  
@zero323 you could add the test code to SPARK-11283 so that they could be 
added back then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...

2015-11-04 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9463#issuecomment-153942273
  
@yu-iskw btw, these are two other false positives that you might want to 
follow up with lintr:

```
R/pkg/inst/tests/test_sparkSQL.R:907:53: style: Commented code should be 
removed.
  expect_equal(collect(select(df2, lpad(df2$a, 8, "#")))[1, 1], "###aaads")
^~~

R/pkg/R/RDD.R:228:63: style: Commented code should be removed.
#' 
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence.
  
^~~~

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11263][SPARKR] lintr Throws Warnings on...

2015-11-04 Thread felixcheung

Github user felixcheung commented on the pull request:

https://github.com/apache/spark/pull/9463#issuecomment-153938433
  
Done. they won't do anything because of `@noRd`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11468][SPARKR] add stddev/variance agg ...

2015-11-05 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9489

[SPARK-11468][SPARKR] add stddev/variance agg functions for Column

Checked names, none of them should conflict with anything in base

@shivaram @davies @rxin 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rstddev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9489.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9489


commit 680b4759e04c7d371fc555e220730e0a4da251f5
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-05T02:37:04Z

Add stddev and friends

commit f63608e3d36bae0544281a94c732760bfebfcc6d
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-05T05:05:57Z

add aggFunction by name and on Column

commit e0fda371d3691a21672a94e9e0b383890cc745a6
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-05T08:27:15Z

Add tests, checked supported functions for GroupedData




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11567][PYTHON] Add Python API for corr ...

2015-11-06 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/9536

[SPARK-11567][PYTHON] Add Python API for corr in group

like `df.agg(corr("col1", "col2")`

@davies 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark pyfunc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9536


commit 49773f3aa812c70049c66b3de48fc188be6d10be
Author: felixcheung <felixcheun...@hotmail.com>
Date:   2015-11-07T07:43:31Z

Add corr that can be used in agg




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...

2015-11-03 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/8984#discussion_r43797767
  
--- Diff: R/pkg/R/types.R ---
@@ -0,0 +1,41 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# types.R. This file handles the data type mapping between Spark and R
+
+# The primitive data types, where names(PRIMITIVE_TYPES) are Scala types 
whereas
+# values are equivalent R types.
+PRIMITIVE_TYPES <- c(
+  "byte"="integer",
+  "tinyint"="integer",
+  "integer"="integer",
+  "float"="numeric",
+  "double"="numeric",
+  "numeric"="numeric",
--- End diff --

I'm concern with this. decimal does not map exactly to numeric
And it says it is not supported:
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4639 matches

Mail list logo