[jira] [Resolved] (SPARK-8847) String concatination with column in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8847. -- Resolution: Duplicate Fix Version/s: 1.5.0 String concatination with column in SparkR -- Key: SPARK-8847 URL: https://issues.apache.org/jira/browse/SPARK-8847 Project: Spark Issue Type: New Feature Components: R Reporter: Amar Gondaliya Fix For: 1.5.0 1. String concatination with the values of the column. i.e. df$newcol -paste(a,df$column) type functionality. 2. String concatination between columns i.e. df$newcol - paste(df$col1,-,df$col2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10007) Update `NAMESPACE` file in SparkR for simple parameters functions
[ https://issues.apache.org/jira/browse/SPARK-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10007: -- Assignee: Yu Ishikawa Update `NAMESPACE` file in SparkR for simple parameters functions - Key: SPARK-10007 URL: https://issues.apache.org/jira/browse/SPARK-10007 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 I appreciate that I forgot to update {{NAMESPACE}} file for the simple parameters functions, such as {{ascii}}, {{base64}} and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10007) Update `NAMESPACE` file in SparkR for simple parameters functions
[ https://issues.apache.org/jira/browse/SPARK-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10007. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8277 [https://github.com/apache/spark/pull/8277] Update `NAMESPACE` file in SparkR for simple parameters functions - Key: SPARK-10007 URL: https://issues.apache.org/jira/browse/SPARK-10007 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 I appreciate that I forgot to update {{NAMESPACE}} file for the simple parameters functions, such as {{ascii}}, {{base64}} and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700058#comment-14700058 ] Shivaram Venkataraman commented on SPARK-9427: -- Well half of the functions are already in branch-1.5 and I guess we should have PRs for some of the other simpler parts (like 9856) come in soon. The more complex ones which require changing SerDe might not be appropriate for 1.5, but my plan is to get as many of the simple ones in as we can ? Add expression functions in SparkR -- Key: SPARK-9427 URL: https://issues.apache.org/jira/browse/SPARK-9427 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa The list of functions to add is based on SQL's functions. And it would be better to add them in one shot PR. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10043) Add window functions into SparkR
[ https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699032#comment-14699032 ] Shivaram Venkataraman commented on SPARK-10043: --- [~yuu.ishik...@gmail.com] Could you clarify which of these functions need support for better `collect` in SparkR. We only need the collect functionality if we are fetching data back to the driver ? Add window functions into SparkR Key: SPARK-10043 URL: https://issues.apache.org/jira/browse/SPARK-10043 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Add window functions as follows in SparkR. I think we should improve {{collect}} function in SparkR. - lead - cumuDist - denseRank - lag - ntile - percentRank - rank - rowNumber -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9871: - Assignee: Yu Ishikawa Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9871) Add expression functions into SparkR which have a variable parameter
[ https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9871. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8194 [https://github.com/apache/spark/pull/8194] Add expression functions into SparkR which have a variable parameter Key: SPARK-9871 URL: https://issues.apache.org/jira/browse/SPARK-9871 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 Add expression functions into SparkR which has a variable parameter, like {{concat}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699080#comment-14699080 ] Shivaram Venkataraman commented on SPARK-9427: -- Yeah I think the simplest thing might be to add a version of `rand(seed: Int)` (or `rand(seed: Double)` if we want to maintain precision ?) to the API and do a cast in Scala to call the version with Long. cc [~rxin] Add expression functions in SparkR -- Key: SPARK-9427 URL: https://issues.apache.org/jira/browse/SPARK-9427 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa The list of functions to add is based on SQL's functions. And it would be better to add them in one shot PR. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606293#comment-14606293 ] Shivaram Venkataraman commented on SPARK-8684: -- The create_image.sh script is only used for generating new AMIs and we don't generate new AMIs very often as its a pretty expensive process to do this on all zones, regions etc. Instead you could try to add a new directory named rstudio and in the init.sh file there you could try to upgrade the R package using yum. This would look somewhat like the ganglia file at https://github.com/mesos/spark-ec2/blob/branch-1.4/ganglia/init.sh BTW the best way to test this is to try out the yum upgrade on a running spark-ec2 cluster and then put the commands into a script. Also you can point spark-ec2 to a custom repository with the flag --spark-ec2-git-repo https://github.com/apache/spark/blob/c6ba2ea341ad23de265d870669b25e6a41f461e5/ec2/spark_ec2.py#L206 (the default is github.com/mesos/spark-ec2). So for example you could point it to your fork kaoning/spark-ec2. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608597#comment-14608597 ] Shivaram Venkataraman commented on SPARK-8699: -- I can't reproduce this. My guess is that this is happening because you have some other packages loaded which are overriding the select function. Fit example if you replace select with SparkR::select does it work ? Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Critical Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+
[ https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608573#comment-14608573 ] Shivaram Venkataraman commented on SPARK-8724: -- I'm not sure what kind of documentation we need -- could you explain more ? Other than YARN cluster mode SparkR should work in all other modes by just running bin/sparkR (for shell) and bin/spark-submit (for batch jobs). Feel free to open a PR if you have a good idea of what would be useful. Need documentation on how to deploy or use SparkR in Spark 1.4.0+ - Key: SPARK-8724 URL: https://issues.apache.org/jira/browse/SPARK-8724 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Felix Cheung Priority: Minor As of now there doesn't seem to be any official documentation on how to deploy SparkR with Spark 1.4.0+ Also, cluster manager specific documentation (like http://spark.apache.org/docs/latest/spark-standalone.html) does not call out what mode is supported for SparkR and details on deployment steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8277) SparkR createDataFrame is slow
[ https://issues.apache.org/jira/browse/SPARK-8277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608689#comment-14608689 ] Shivaram Venkataraman commented on SPARK-8277: -- Yeah so the bottleneck is in converting R data frames from columns to a list of rows. It would be interesting to see if we can serialize each column at a time and then somehow add them as columns to the Scala DataFrame (or do a column to row conversion in Scala). [~cafreeman] was looking at some related stuff at some point. SparkR createDataFrame is slow -- Key: SPARK-8277 URL: https://issues.apache.org/jira/browse/SPARK-8277 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman For example calling `createDataFrame` on the data from http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv takes a really long time This is mainly because we try to convert a DataFrame to a List in order to parallelize it by rows and the conversion from DF to list is very slow for large data frames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians
[ https://issues.apache.org/jira/browse/SPARK-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609002#comment-14609002 ] Shivaram Venkataraman commented on SPARK-7210: -- A more stable way would probably be do a QR decomposition and then get the SVD from it. There are a bunch of QR algorithms implemented at https://github.com/amplab/ml-matrix in case anybody wants to take a shot at this. Test matrix decompositions for speed vs. numerical stability for Gaussians -- Key: SPARK-7210 URL: https://issues.apache.org/jira/browse/SPARK-7210 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor We currently use SVD for inverting the Gaussian's covariance matrix and computing the determinant. SVD is numerically stable but slow. We could experiment with Cholesky, etc. to figure out a better option, or a better option for certain settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians
[ https://issues.apache.org/jira/browse/SPARK-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609052#comment-14609052 ] Shivaram Venkataraman commented on SPARK-7210: -- Sorry I think I misunderstood the JIRA title a little bit. I was commenting on generating procedures for computing SVD of a matrix. I am not really sure what the problem setting is inside the GMM. Test matrix decompositions for speed vs. numerical stability for Gaussians -- Key: SPARK-7210 URL: https://issues.apache.org/jira/browse/SPARK-7210 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor We currently use SVD for inverting the Gaussian's covariance matrix and computing the determinant. SVD is numerically stable but slow. We could experiment with Cholesky, etc. to figure out a better option, or a better option for certain settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609054#comment-14609054 ] Shivaram Venkataraman commented on SPARK-8684: -- Building from source might take a while and it wouldn't be a good idea to do it by default. We could put it behind a flat (--r-version=3.2) and then only build from source if the user specifies the flag. But the yum option could be made default if we could get it to work. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609148#comment-14609148 ] Shivaram Venkataraman commented on SPARK-8596: -- I think the assumption is that the root user is running the scripts in /root/spark/bin -- No other use cases have been tests AFAIK. On the other hand the Spark master (i.e the service running at spark://master_host_name:7077 doesn't do any authentication as far as I know. So we should be able to submit jobs from other user accounts but you might need to copy Spark to that user's account before running things. Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6803: - Target Version/s: (was: 1.5.0) [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)
[ https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651196#comment-14651196 ] Shivaram Venkataraman commented on SPARK-6823: -- [~ekhliang] [~mengxr] Is this addressed by the StringType PR ? I'm wondering if we can resolve this issue Add a model.matrix like capability to DataFrames (modelDataFrame) - Key: SPARK-6823 URL: https://issues.apache.org/jira/browse/SPARK-6823 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Shivaram Venkataraman Currently Mllib modeling tools work only with double data. However, data tables in practice often have a set of categorical fields (factors in R), that need to be converted to a set of 0/1 indicator variables (making the data actually used in a modeling algorithm completely numeric). In R, this is handled in modeling functions using the model.matrix function. Similar functionality needs to be available within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6816: - Target Version/s: (was: 1.5.0) Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication
[ https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6832: - Target Version/s: (was: 1.5.0) Handle partial reads in SparkR JVM to worker communication -- Key: SPARK-6832 URL: https://issues.apache.org/jira/browse/SPARK-6832 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor After we move to use socket between R worker and JVM, it's possible that readBin() in R will return partial results (for example, interrupted by signal). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6821) Refactor SerDe API in SparkR to be more developer friendly
[ https://issues.apache.org/jira/browse/SPARK-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6821: - Target Version/s: (was: 1.5.0) Refactor SerDe API in SparkR to be more developer friendly -- Key: SPARK-6821 URL: https://issues.apache.org/jira/browse/SPARK-6821 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman The existing SerDe API we use in the SparkR JVM backend is limited and not very easy to use. We should refactor it to make it use more of Scala's type system and also allow extensions for user-defined S3 or S4 types in R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9319) Add support for setting column names, types
[ https://issues.apache.org/jira/browse/SPARK-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651193#comment-14651193 ] Shivaram Venkataraman commented on SPARK-9319: -- [~falaki] I believe we already added support for setting column names with `names(data) - c(Date)` ? Should we also just make `colnames` a synonym for `names` ? Add support for setting column names, types --- Key: SPARK-9319 URL: https://issues.apache.org/jira/browse/SPARK-9319 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman This will help us support functions of the form {code} colnames(data) - c(“Date”, “Arrival_Delay”) coltypes(data) - c(“numeric”, “logical”, “character”) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6798) Fix Date serialization in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651199#comment-14651199 ] Shivaram Venkataraman commented on SPARK-6798: -- [~davies] Do you remember if this is this an actual bug or just a clunky implementation detail ? I'm thinking of changing the type of this JIRA to `Improvement` and unsetting its target version. Let me know if this sounds good to you. Fix Date serialization in SparkR Key: SPARK-6798 URL: https://issues.apache.org/jira/browse/SPARK-6798 Project: Spark Issue Type: Bug Components: SparkR Reporter: Shivaram Venkataraman Assignee: Davies Liu Priority: Minor SparkR's date serialization right now sends strings from R to the JVM. We should convert this to integers and also account for timezones correctly by using DateUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6809: - Target Version/s: (was: 1.5.0) Make numPartitions optional in pairRDD APIs --- Key: SPARK-6809 URL: https://issues.apache.org/jira/browse/SPARK-6809 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6815) Support accumulators in R
[ https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6815: - Target Version/s: (was: 1.5.0) Support accumulators in R - Key: SPARK-6815 URL: https://issues.apache.org/jira/browse/SPARK-6815 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor SparkR doesn't support acccumulators right now. It might be good to add support for this to get feature parity with PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6838) Explore using Reference Classes instead of S4 objects
[ https://issues.apache.org/jira/browse/SPARK-6838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6838: - Target Version/s: (was: 1.5.0) Explore using Reference Classes instead of S4 objects - Key: SPARK-6838 URL: https://issues.apache.org/jira/browse/SPARK-6838 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor The current RDD and PipelinedRDD are represented in S4 objects. R has a new OO system: Reference Class (RC or R5). It seems to be a more message-passing OO and instances are mutable objects. It is not an important issue, but it should also require trivial work. It could also remove the kind-of awkward @ operator in S4. R6 is also worth checking out. Feels closer to your ordinary object oriented language. https://github.com/wch/R6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8082) Functionality to Reset DF Schemas/Cast Multiple Columns
[ https://issues.apache.org/jira/browse/SPARK-8082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8082: - Target Version/s: (was: 1.5.0) Functionality to Reset DF Schemas/Cast Multiple Columns --- Key: SPARK-8082 URL: https://issues.apache.org/jira/browse/SPARK-8082 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Aleksander Eskilson Priority: Minor Currently only one column can be casted at a time with the cast() function. Either a cast with multiple arguments and/or a function allowing the DF schema to be reset would cut down on code to recast a DF in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6810) Performance benchmarks for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6810: - Target Version/s: (was: 1.5.0) Performance benchmarks for SparkR - Key: SPARK-6810 URL: https://issues.apache.org/jira/browse/SPARK-6810 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should port some performance benchmarks from spark-perf to SparkR for tracking performance regressions / improvements. https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6825) Data sources implementation to support `sequenceFile`
[ https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6825: - Target Version/s: (was: 1.5.0) Data sources implementation to support `sequenceFile` - Key: SPARK-6825 URL: https://issues.apache.org/jira/browse/SPARK-6825 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman SequenceFiles are a widely used input format and right now they are not supported in SparkR. It would be good to add support for SequenceFiles by implementing a new data source that can create a DataFrame from a SequenceFile. However as SequenceFiles can have arbitrary types, we probably need to map them to User-defined types in SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6831) Document how to use external data sources
[ https://issues.apache.org/jira/browse/SPARK-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651198#comment-14651198 ] Shivaram Venkataraman commented on SPARK-6831: -- [~yhuai] Is this something we will plan to do for 1.5 ? If not we can unset the target version for this JIRA Document how to use external data sources - Key: SPARK-6831 URL: https://issues.apache.org/jira/browse/SPARK-6831 Project: Spark Issue Type: Improvement Components: Documentation, PySpark, SparkR, SQL Reporter: Shivaram Venkataraman Priority: Critical We should include some instructions on how to use an external datasource for users who are beginners. Do they need to install it on all the machines ? Or just the master ? Are there are any special flags they need to pass to `bin/spark-submit` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8684: - Target Version/s: (was: 1.5.0) Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Fix For: 1.5.0 Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9443) Expose sampleByKey in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9443: - Summary: Expose sampleByKey in SparkR (was: Explose sampleByKey in SparkR) Expose sampleByKey in SparkR Key: SPARK-9443 URL: https://issues.apache.org/jira/browse/SPARK-9443 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Hossein Falaki There is pull request for DataFrames (I believe close to merging) that adds sampleByKey. It would be great to expose it in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9053. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7584 [https://github.com/apache/spark/pull/7584] Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9053: - Assignee: Yu Ishikawa Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Yu Ishikawa Fix For: 1.5.0 We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9510) Fix remaining SparkR style violations
Shivaram Venkataraman created SPARK-9510: Summary: Fix remaining SparkR style violations Key: SPARK-9510 URL: https://issues.apache.org/jira/browse/SPARK-9510 Project: Spark Issue Type: Sub-task Reporter: Shivaram Venkataraman lint-r should report no errors / warnings before we can turn it on in Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8742) Improve SparkR error messages for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-8742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8742: - Assignee: Hossein Falaki Improve SparkR error messages for DataFrame API --- Key: SPARK-8742 URL: https://issues.apache.org/jira/browse/SPARK-8742 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Hossein Falaki Assignee: Hossein Falaki Priority: Blocker Fix For: 1.5.0 Currently all DataFrame API errors result in following generic error: {code} Error: returnStatus == 0 is not TRUE {code} This is because invokeJava in backend.R does not inspect error messages. For most use cases it is critical to return better error messages. Initially, we can return the stack trace from the JVM. In future we can inspect the errors and translate them to human-readable error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8742) Improve SparkR error messages for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-8742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8742. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7742 [https://github.com/apache/spark/pull/7742] Improve SparkR error messages for DataFrame API --- Key: SPARK-8742 URL: https://issues.apache.org/jira/browse/SPARK-8742 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Hossein Falaki Priority: Blocker Fix For: 1.5.0 Currently all DataFrame API errors result in following generic error: {code} Error: returnStatus == 0 is not TRUE {code} This is because invokeJava in backend.R does not inspect error messages. For most use cases it is critical to return better error messages. Initially, we can return the stack trace from the JVM. In future we can inspect the errors and translate them to human-readable error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9510) Fix remaining SparkR style violations
[ https://issues.apache.org/jira/browse/SPARK-9510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-9510: Assignee: Shivaram Venkataraman Fix remaining SparkR style violations - Key: SPARK-9510 URL: https://issues.apache.org/jira/browse/SPARK-9510 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.5.0 lint-r should report no errors / warnings before we can turn it on in Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9324) Add `unique` as a synonym for `distinct`
[ https://issues.apache.org/jira/browse/SPARK-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9324: - Assignee: Hossein Falaki Add `unique` as a synonym for `distinct` Key: SPARK-9324 URL: https://issues.apache.org/jira/browse/SPARK-9324 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Hossein Falaki Fix For: 1.5.0 In R unique returns a new data.frame with duplicate rows removed. cc [~rxin] is there some different meaning for `unique` in Spark ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9510) Fix remaining SparkR style violations
[ https://issues.apache.org/jira/browse/SPARK-9510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9510. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7834 [https://github.com/apache/spark/pull/7834] Fix remaining SparkR style violations - Key: SPARK-9510 URL: https://issues.apache.org/jira/browse/SPARK-9510 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 lint-r should report no errors / warnings before we can turn it on in Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9322) Add rbind as a synonym for `unionAll`
[ https://issues.apache.org/jira/browse/SPARK-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9322: - Assignee: Hossein Falaki Add rbind as a synonym for `unionAll` - Key: SPARK-9322 URL: https://issues.apache.org/jira/browse/SPARK-9322 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Hossein Falaki Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9321) Add nrow, ncol, dim for SparkR data frames
[ https://issues.apache.org/jira/browse/SPARK-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9321. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7764 [https://github.com/apache/spark/pull/7764] Add nrow, ncol, dim for SparkR data frames -- Key: SPARK-9321 URL: https://issues.apache.org/jira/browse/SPARK-9321 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 `nrow` will be a synonym for `count` and `ncol` can be implemented using `columns()` or `dtypes` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9321) Add nrow, ncol, dim for SparkR data frames
[ https://issues.apache.org/jira/browse/SPARK-9321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9321: - Assignee: Hossein Falaki Add nrow, ncol, dim for SparkR data frames -- Key: SPARK-9321 URL: https://issues.apache.org/jira/browse/SPARK-9321 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Hossein Falaki Fix For: 1.5.0 `nrow` will be a synonym for `count` and `ncol` can be implemented using `columns()` or `dtypes` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9322) Add rbind as a synonym for `unionAll`
[ https://issues.apache.org/jira/browse/SPARK-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9322. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7764 [https://github.com/apache/spark/pull/7764] Add rbind as a synonym for `unionAll` - Key: SPARK-9322 URL: https://issues.apache.org/jira/browse/SPARK-9322 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9324) Add `unique` as a synonym for `distinct`
[ https://issues.apache.org/jira/browse/SPARK-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9324. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7764 [https://github.com/apache/spark/pull/7764] Add `unique` as a synonym for `distinct` Key: SPARK-9324 URL: https://issues.apache.org/jira/browse/SPARK-9324 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 In R unique returns a new data.frame with duplicate rows removed. cc [~rxin] is there some different meaning for `unique` in Spark ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9562) Move spark-ec2 from mesos to amplab
Shivaram Venkataraman created SPARK-9562: Summary: Move spark-ec2 from mesos to amplab Key: SPARK-9562 URL: https://issues.apache.org/jira/browse/SPARK-9562 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Shivaram Venkataraman See http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Should-spark-ec2-get-its-own-repo-td13151.html for more details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9248: - Assignee: Yu Ishikawa Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Priority: Minor Fix For: 1.5.0 Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9248. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7795 [https://github.com/apache/spark/pull/7795] Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor Fix For: 1.5.0 Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648025#comment-14648025 ] Shivaram Venkataraman commented on SPARK-9437: -- Resolved by https://github.com/apache/spark/pull/7750 SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor Fix For: 1.5.0 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9437. -- Resolution: Fixed Fix Version/s: 1.5.0 SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor Fix For: 1.5.0 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+
[ https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682219#comment-14682219 ] Shivaram Venkataraman commented on SPARK-8724: -- [~cantdutchthis] One thing we could do is add a section at the bottom of http://spark.apache.org/docs/latest/sparkr.html titled `Deploying SparkR` or `Where to go from here` and a short description of how to launch EC2 clusters with RStudio (in 1.5) and also link to the RStudio blog post. Need documentation on how to deploy or use SparkR in Spark 1.4.0+ - Key: SPARK-8724 URL: https://issues.apache.org/jira/browse/SPARK-8724 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Felix Cheung Priority: Minor As of now there doesn't seem to be any official documentation on how to deploy SparkR with Spark 1.4.0+ Also, cluster manager specific documentation (like http://spark.apache.org/docs/latest/spark-standalone.html) does not call out what mode is supported for SparkR and details on deployment steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9562) Move spark-ec2 from mesos to amplab
[ https://issues.apache.org/jira/browse/SPARK-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-9562: Assignee: Shivaram Venkataraman Move spark-ec2 from mesos to amplab --- Key: SPARK-9562 URL: https://issues.apache.org/jira/browse/SPARK-9562 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.5.0 See http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Should-spark-ec2-get-its-own-repo-td13151.html for more details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9562) Move spark-ec2 from mesos to amplab
[ https://issues.apache.org/jira/browse/SPARK-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9562. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7899 [https://github.com/apache/spark/pull/7899] Move spark-ec2 from mesos to amplab --- Key: SPARK-9562 URL: https://issues.apache.org/jira/browse/SPARK-9562 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Shivaram Venkataraman Fix For: 1.5.0 See http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Should-spark-ec2-get-its-own-repo-td13151.html for more details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9603) Re-enable complex R package test in SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9603: - Component/s: SparkR Re-enable complex R package test in SparkSubmitSuite Key: SPARK-9603 URL: https://issues.apache.org/jira/browse/SPARK-9603 Project: Spark Issue Type: Test Components: Deploy, SparkR, Tests Affects Versions: 1.5.0 Reporter: Burak Yavuz For building complex Spark Packages that contain R code in addition to Scala, we have a complex procedure, where R source code is shipped inside a jar. The source code is extracted, built, and is added as a library among SparkR. The end to end test in SparkSubmitSuite (correctly builds R packages included in a jar with --packages) can't run on Jenkins now, because the pull request builder is not built with SparkR. Once the PR Builder is built with SparkR, we should re-enable the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9605) SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654073#comment-14654073 ] Shivaram Venkataraman commented on SPARK-9605: -- The amplab version of SparkR is no longer supported and the SparkR project has become a part of the Apache Spark project. Please follow instructions to download and run Spark ( 1.4) at http://spark.apache.org/docs/latest/#downloading SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- Key: SPARK-9605 URL: https://issues.apache.org/jira/browse/SPARK-9605 Project: Spark Issue Type: Bug Environment: R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] devtools_1.6.1 rJava_0.9-7 loaded via a namespace (and not attached): [1] bitops_1.0-6 httr_0.5 magrittr_1.5 RCurl_1.95-4.5 stringi_0.5-5 stringr_1.0.0 tools_3.1.2 Reporter: Selcuk Korkmaz I am fairly new to Spark! I am trying to install SparkR package. But I am getting following error: library(devtools) install_github(amplab-extras/SparkR-pkg, subdir=pkg) Downloading github repo amplab-extras/SparkR-pkg@master Installing SparkR Installing dependencies for SparkR: '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \ '/private/var/folders/x_/y8_3xqc130n1q55fwwkmgm00gn/T/RtmpRH9vkn/devtools1ec166a2c628/amplab-extras-SparkR-pkg-e532627/pkg' \ --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests installing source package 'SparkR' ... libs arch - ./sbt/sbt assembly Attempting to fetch sbt 'SparkR' removing '/Library/Frameworks/R.framework/Versions/3.1/Resources/library/SparkR' Error: Command failed (1) I have installed scala-2.11.7 with following approach. $ brew update $ brew install scala $ brew install sbt I could not install scala-2.10. Is this the part of the problem. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9605) SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654149#comment-14654149 ] Shivaram Venkataraman commented on SPARK-9605: -- You don't need to install SparkR package in R. You can download Spark 1.4.1 from http://spark.apache.org/downloads.html, unzip it and then run ./bin/sparkR. BTW this is a more appropriate question for the Spark user mailing list (http://spark.apache.org/community.html) and not for the JIRA (which is used for bug reports, development tracking etc.) SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- Key: SPARK-9605 URL: https://issues.apache.org/jira/browse/SPARK-9605 Project: Spark Issue Type: Bug Environment: R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] devtools_1.6.1 rJava_0.9-7 loaded via a namespace (and not attached): [1] bitops_1.0-6 httr_0.5 magrittr_1.5 RCurl_1.95-4.5 stringi_0.5-5 stringr_1.0.0 tools_3.1.2 Reporter: Selcuk Korkmaz I am fairly new to Spark! I am trying to install SparkR package. But I am getting following error: library(devtools) install_github(amplab-extras/SparkR-pkg, subdir=pkg) Downloading github repo amplab-extras/SparkR-pkg@master Installing SparkR Installing dependencies for SparkR: '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \ '/private/var/folders/x_/y8_3xqc130n1q55fwwkmgm00gn/T/RtmpRH9vkn/devtools1ec15ed080d/amplab-extras-SparkR-pkg-e532627/pkg' \ --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests * installing *source* package ‘SparkR’ ... ** libs ** arch - ./sbt/sbt assembly Attempting to fetch sbt Launching sbt from sbt/sbt-launch-0.13.6.jar Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar make: *** [target/scala-2.10/sparkr-assembly-0.1.jar] Error 1 ERROR: compilation failed for package ‘SparkR’ * removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/SparkR’ Error: Command failed (1) I have installed scala-2.11.7 with following approach. $ brew update $ brew install scala $ brew install sbt I could not install scala-2.10. Is this the part of the problem. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9605) SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
[ https://issues.apache.org/jira/browse/SPARK-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9605. -- Resolution: Not A Problem SparkR installation error: Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar -- Key: SPARK-9605 URL: https://issues.apache.org/jira/browse/SPARK-9605 Project: Spark Issue Type: Bug Environment: R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] devtools_1.6.1 rJava_0.9-7 loaded via a namespace (and not attached): [1] bitops_1.0-6 httr_0.5 magrittr_1.5 RCurl_1.95-4.5 stringi_0.5-5 stringr_1.0.0 tools_3.1.2 Reporter: Selcuk Korkmaz I am fairly new to Spark! I am trying to install SparkR package. But I am getting following error: library(devtools) install_github(amplab-extras/SparkR-pkg, subdir=pkg) Downloading github repo amplab-extras/SparkR-pkg@master Installing SparkR Installing dependencies for SparkR: '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \ '/private/var/folders/x_/y8_3xqc130n1q55fwwkmgm00gn/T/RtmpRH9vkn/devtools1ec15ed080d/amplab-extras-SparkR-pkg-e532627/pkg' \ --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests * installing *source* package ‘SparkR’ ... ** libs ** arch - ./sbt/sbt assembly Attempting to fetch sbt Launching sbt from sbt/sbt-launch-0.13.6.jar Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar make: *** [target/scala-2.10/sparkr-assembly-0.1.jar] Error 1 ERROR: compilation failed for package ‘SparkR’ * removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/SparkR’ Error: Command failed (1) I have installed scala-2.11.7 with following approach. $ brew update $ brew install scala $ brew install sbt I could not install scala-2.10. Is this the part of the problem. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9972) Add `struct` function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697232#comment-14697232 ] Shivaram Venkataraman commented on SPARK-9972: -- Yeah this can be marked as being blocked by https://issues.apache.org/jira/browse/SPARK-6819 Add `struct` function in SparkR --- Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite Do not clear received block data too soon
[ https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-7420: - Labels: flaky-test (was: ) Flaky test: o.a.s.streaming.JobGeneratorSuite Do not clear received block data too soon - Key: SPARK-7420 URL: https://issues.apache.org/jira/browse/SPARK-7420 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.1, 1.4.0 Reporter: Andrew Or Assignee: Tathagata Das Priority: Critical Labels: flaky-test {code} The code passed to eventually never returned normally. Attempted 18 times over 10.13803606001 seconds. Last failure message: receiverTracker.hasUnallocatedBlocks was false. {code} It seems to be failing only in maven. https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9710) RPackageUtilsSuite fails if R is not installed
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9710. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8008 [https://github.com/apache/spark/pull/8008] RPackageUtilsSuite fails if R is not installed -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Fix For: 1.5.0 That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9710) RPackageUtilsSuite fails if R is not installed
[ https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9710: - Assignee: Marcelo Vanzin RPackageUtilsSuite fails if R is not installed -- Key: SPARK-9710 URL: https://issues.apache.org/jira/browse/SPARK-9710 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.5.0 That's because there's a bug in RUtils.scala. PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9972) Add `struct`, `encode` and `decode` function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698012#comment-14698012 ] Shivaram Venkataraman commented on SPARK-9972: -- [~yuu.ishik...@gmail.com] Why does `sort_array` need nested types ? The sorting is only going to happen in the Java side and the return type is only a Column ? Add `struct`, `encode` and `decode` function in SparkR -- Key: SPARK-9972 URL: https://issues.apache.org/jira/browse/SPARK-9972 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Support {{struct}} function on a DataFrame in SparkR. However, I think we need to improve {{collect}} function in SparkR in order to implement {{struct}} function. - struct - encode - decode - array_contains - sort_array -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692662#comment-14692662 ] Shivaram Venkataraman commented on SPARK-9427: -- [~yuu.ishik...@gmail.com] Breaking it into 3 PRs sounds good to me. Do you have an idea of how many functions there are of each type ? Add expression functions in SparkR -- Key: SPARK-9427 URL: https://issues.apache.org/jira/browse/SPARK-9427 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa The list of functions to add is based on SQL's functions. And it would be better to add them in one shot PR. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693814#comment-14693814 ] Shivaram Venkataraman commented on SPARK-9865: -- So we sample 10% in a DataFrame with 3 rows and expect to get less than 3 rows. I guess there is a very small chance that you still get back 3 rows. One fix for this might be to just sample 1% ? [~davies] Do you have any other fix in mind ? Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame - Key: SPARK-9865 URL: https://issues.apache.org/jira/browse/SPARK-9865 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame - count(sampled3) 3 isn't true Error: Test failures Execution halted https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8313) Support Spark Packages containing R code with --packages
[ https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8313. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7139 [https://github.com/apache/spark/pull/7139] Support Spark Packages containing R code with --packages Key: SPARK-8313 URL: https://issues.apache.org/jira/browse/SPARK-8313 Project: Spark Issue Type: New Feature Components: Spark Submit, SparkR Reporter: Burak Yavuz Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8313) Support Spark Packages containing R code with --packages
[ https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8313: - Assignee: Burak Yavuz Support Spark Packages containing R code with --packages Key: SPARK-8313 URL: https://issues.apache.org/jira/browse/SPARK-8313 Project: Spark Issue Type: New Feature Components: Spark Submit, SparkR Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635300#comment-14635300 ] Shivaram Venkataraman commented on SPARK-9121: -- Yeah we can add `install-dev.sh` in Jenkins before dev/lint-r. One unfortunate thing is that we typically do a lint-check before we run the rest of the Jenkins tests (build, unit tests etc.) So it would be good to not have this be the other way around I guess Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636060#comment-14636060 ] Shivaram Venkataraman commented on SPARK-9053: -- Yeah - there are a bunch of real issues to be fixed first and we can discuss the ignore rule after that. Also I don't think we should ignore all warnings of this form -- just say on the `^` operator or we can mark out portions of the code that need to be ignored etc. Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9121. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7567 [https://github.com/apache/spark/pull/7567] Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9121: - Assignee: Yu Ishikawa Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9230) SparkR RFormula should support StringType features
[ https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635776#comment-14635776 ] Shivaram Venkataraman commented on SPARK-9230: -- [~ekhliang] [~mengxr] One more thing that would be good to do is to make these formulas also work with actual columns in R. For example in DataFrames we parse columns with df$col_name. So it will be great to support a formula of the kind df$Sepal_Length ~ df$Sepal_Width SparkR RFormula should support StringType features -- Key: SPARK-9230 URL: https://issues.apache.org/jira/browse/SPARK-9230 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Eric Liang StringType features will need to be encoded using OneHotEncoder to be used for regression. See umbrella design doc https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9230) SparkR RFormula should support StringType features
[ https://issues.apache.org/jira/browse/SPARK-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635796#comment-14635796 ] Shivaram Venkataraman commented on SPARK-9230: -- The thing to do there would be to capture it as SparkR DataFrame columns. so df$Sepal_Width actually resolves to a Java column class and then we can parse those in RFormula -- So in some sense we'll have two constructors, one from strings and one from DataFrame columns. SparkR RFormula should support StringType features -- Key: SPARK-9230 URL: https://issues.apache.org/jira/browse/SPARK-9230 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Eric Liang StringType features will need to be encoded using OneHotEncoder to be used for regression. See umbrella design doc https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9322) Add rbind as a synonym for `unionAll`
Shivaram Venkataraman created SPARK-9322: Summary: Add rbind as a synonym for `unionAll` Key: SPARK-9322 URL: https://issues.apache.org/jira/browse/SPARK-9322 Project: Spark Issue Type: Sub-task Reporter: Shivaram Venkataraman -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8364) Add crosstab to SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8364. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7318 [https://github.com/apache/spark/pull/7318] Add crosstab to SparkR DataFrames - Key: SPARK-8364 URL: https://issues.apache.org/jira/browse/SPARK-8364 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.0 Add `crosstab` to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to `table` in R. However, `table` in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for `crosstab` to be compatible with Scala/Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8807. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7356 [https://github.com/apache/spark/pull/7356] Add between operator in SparkR -- Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8807: - Assignee: Liang-Chi Hsieh Add between operator in SparkR -- Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9053) Fix spaces around parens, infix operators etc.
Shivaram Venkataraman created SPARK-9053: Summary: Fix spaces around parens, infix operators etc. Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Reporter: Shivaram Venkataraman We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8808) Fix assignments in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8808: - Assignee: Sun Rui Fix assignments in SparkR - Key: SPARK-8808 URL: https://issues.apache.org/jira/browse/SPARK-8808 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Sun Rui Fix For: 1.5.0 {noformat} inst/tests/test_binary_function.R:79:12: style: Use -, not =, for assignment. mockFile = c(Spark is pretty., Spark is awesome.) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8808) Fix assignments in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8808. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7395 [https://github.com/apache/spark/pull/7395] Fix assignments in SparkR - Key: SPARK-8808 URL: https://issues.apache.org/jira/browse/SPARK-8808 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 {noformat} inst/tests/test_binary_function.R:79:12: style: Use -, not =, for assignment. mockFile = c(Spark is pretty., Spark is awesome.) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9052) Fix comments after curly braces
Shivaram Venkataraman created SPARK-9052: Summary: Fix comments after curly braces Key: SPARK-9052 URL: https://issues.apache.org/jira/browse/SPARK-9052 Project: Spark Issue Type: Sub-task Reporter: Shivaram Venkataraman Right now we have a number of style check errors of the form {code} Opening curly braces should never go on their own line and should always and be followed by a new line. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631590#comment-14631590 ] Shivaram Venkataraman commented on SPARK-9121: -- [~yuu.ishik...@gmail.com] I think I found a fix for this problem. If we include the SparkR package in dev/lint-r.R before we call lint_package then we dont' get these errors {code} library(SparkR, lib.loc=paste(SPARK_ROOT_DIR, /R, /lib, sep = )) {code} Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8596. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7366 [https://github.com/apache/spark/pull/7366] Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path
[ https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624781#comment-14624781 ] Shivaram Venkataraman commented on SPARK-8952: -- So the reason normalizePath exists is to make the local file paths work correctly (i.e things like ~/spark/README.md) -- Maybe we could have a function that does this on the Scala side but also verifies this with the Hadoop Configuration ? cc [~davies] JsonFile() of SQLContext display improper warning message for a S3 path --- Key: SPARK-8952 URL: https://issues.apache.org/jira/browse/SPARK-8952 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Ben Spark ben_spar...@yahoo.com.au. {quote} Spark 1.4 deployed on AWS EMR jsonFile is working though with some warning message Warning message: In normalizePath(path) : path[1]=s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0: No such file or directory {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8596: - Assignee: Vincent Warmerdam Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Assignee: Vincent Warmerdam Fix For: 1.5.0 This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6797) Add support for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6797. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6743 [https://github.com/apache/spark/pull/6743] Add support for YARN cluster mode - Key: SPARK-6797 URL: https://issues.apache.org/jira/browse/SPARK-6797 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Assignee: Sun Rui Priority: Critical Fix For: 1.5.0 SparkR currently does not work in YARN cluster mode as the R package is not shipped along with the assembly jar to the YARN AM. We could try to use the support for archives in YARN to send out the R package as a zip file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9201) Integrate MLlib with SparkR using RFormula
[ https://issues.apache.org/jira/browse/SPARK-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9201. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7483 [https://github.com/apache/spark/pull/7483] Integrate MLlib with SparkR using RFormula -- Key: SPARK-9201 URL: https://issues.apache.org/jira/browse/SPARK-9201 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Eric Liang Assignee: Eric Liang Fix For: 1.5.0 We need to interface R glm() and predict() with mllib R formula support. Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634626#comment-14634626 ] Shivaram Venkataraman commented on SPARK-9121: -- Yeah - we can just call `install-dev.sh` before running the lint script to make sure of that if this is required. Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10219) Error when additional options provided as variable in write.df
[ https://issues.apache.org/jira/browse/SPARK-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711642#comment-14711642 ] Shivaram Venkataraman commented on SPARK-10219: --- I think thats happening because `mode` is actually an argument name that is taken in by the write.df method -- So I am not sure you need option=mode, but just mode=mode or mode=append should work ? Error when additional options provided as variable in write.df -- Key: SPARK-10219 URL: https://issues.apache.org/jira/browse/SPARK-10219 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: SparkR shell Reporter: Samuel Alexander Labels: spark-shell, sparkR Opened a SparkR shell Created a df using df - jsonFile(sqlContext, examples/src/main/resources/people.json) Assigned a variable like below mode - append When write.df called using below statement got the mentioned error write.df(df, source=org.apache.spark.sql.parquet, path=par_path, option=mode) Error in writeType(con, type) : Unsupported type for serialization name Whereas mode is passed as append itself, i.e. not via mode variable as below everything works fine write.df(df, source=org.apache.spark.sql.parquet, path=par_path, option=append) Note: For parquet it is not needed to hanve option. But we are using Spark Salesforce package (http://spark-packages.org/package/springml/spark-salesforce) which require additional options to be passed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10214) Improve SparkR Column, DataFrame API docs
Shivaram Venkataraman created SPARK-10214: - Summary: Improve SparkR Column, DataFrame API docs Key: SPARK-10214 URL: https://issues.apache.org/jira/browse/SPARK-10214 Project: Spark Issue Type: Documentation Components: SparkR Reporter: Shivaram Venkataraman Right now the docs for functions like `agg` and `filter` have duplicate entries like `agg-method` and `filter-method` etc. We should use the `name` Rd tag and remove these duplicates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10214) Improve SparkR Column, DataFrame API docs
[ https://issues.apache.org/jira/browse/SPARK-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710404#comment-14710404 ] Shivaram Venkataraman commented on SPARK-10214: --- cc [~yuu.ishik...@gmail.com] Improve SparkR Column, DataFrame API docs - Key: SPARK-10214 URL: https://issues.apache.org/jira/browse/SPARK-10214 Project: Spark Issue Type: Documentation Components: SparkR Reporter: Shivaram Venkataraman Right now the docs for functions like `agg` and `filter` have duplicate entries like `agg-method` and `filter-method` etc. We should use the `name` Rd tag and remove these duplicates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10118) Improve SparkR API docs for 1.5 release
[ https://issues.apache.org/jira/browse/SPARK-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10118: -- Assignee: Yu Ishikawa Improve SparkR API docs for 1.5 release --- Key: SPARK-10118 URL: https://issues.apache.org/jira/browse/SPARK-10118 Project: Spark Issue Type: Documentation Components: Documentation, SparkR Reporter: Shivaram Venkataraman Assignee: Yu Ishikawa Fix For: 1.5.0 This includes checking if the new DataFrame functions expression show up appropriately in the roxygen docs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10118) Improve SparkR API docs for 1.5 release
[ https://issues.apache.org/jira/browse/SPARK-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10118. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8386 [https://github.com/apache/spark/pull/8386] Improve SparkR API docs for 1.5 release --- Key: SPARK-10118 URL: https://issues.apache.org/jira/browse/SPARK-10118 Project: Spark Issue Type: Documentation Components: Documentation, SparkR Reporter: Shivaram Venkataraman Fix For: 1.5.0 This includes checking if the new DataFrame functions expression show up appropriately in the roxygen docs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972765#comment-14972765 ] Shivaram Venkataraman commented on SPARK-11255: --- cc [~shaneknapp] > R Test build should run on R 3.1.1 > -- > > Key: SPARK-11255 > URL: https://issues.apache.org/jira/browse/SPARK-11255 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Priority: Minor > > Test should run on R 3.1.1 which is the version listed as supported. > Apparently there are few R changes that can go undetected since Jenkins Test > build is running something newer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns
[ https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967364#comment-14967364 ] Shivaram Venkataraman commented on SPARK-11231: --- [~Narine] [~sunrui] Is this covered by https://github.com/apache/spark/pull/9012 or does this require some changes on the scala side ? cc [~davies] > join returns schema with duplicated and ambiguous join columns > -- > > Key: SPARK-11231 > URL: https://issues.apache.org/jira/browse/SPARK-11231 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: R >Reporter: Matt Pollock > > In the case where the key column of two data frames are named the same thing, > join returns a data frame where that column is duplicated. Since the content > of the columns is guaranteed to be the same by row consolidating the > identical columns into a single column would replicate standard R behavior[1] > and help prevent ambiguous names. > Example: > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > sdf1 <- createDataFrame(sqlContext, df1) > > sdf2 <- createDataFrame(sqlContext, df2) > > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner") > > schema(sjdf) > StructType > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value1", type = "DoubleType", nullable = TRUE > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value2", type = "DoubleType", nullable = TRUE > {code} > The duplicated key columns cause things like: > {code} > > library(magrittr) > > sjdf %>% select("key") > 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could > be: key#125, key#127.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399) > at org.apache.spark.sql.catalyst.tree > {code} > [1] In base R there is no"join", but a similar function "merge" is provided > in which a "by" argument identifies the shared key column in the two data > frames. In the case where the key column names differ "by.x" and "by.y" > arguments can be used. In the case of same-named key columns the > consolidation behavior requested above is observed. In the case of differing > names they "by.x" name is retained and consolidated with the "by.y" column > which is dropped. > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > merge(df1, df2, by="key") > key value1 value2 > 1 A 1 4 > 2 B 2 5 > 3 C 3 6 > df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3)) > > merge(df2, df3, by.x="key", by.y="akey") > key value2 value1 > 1 A 4 1 > 2 B 5 2 > 3 C 6 3 > > merge(df3, df2, by.x="akey", by.y="key") > akey value1 value2 > 1A 1 4 > 2B 2 5 > 3C 3 6 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11244) sparkR.stop doesn't clean up .sparkRSQLsc in environment
[ https://issues.apache.org/jira/browse/SPARK-11244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967778#comment-14967778 ] Shivaram Venkataraman commented on SPARK-11244: --- Good catch -- Could you send a PR for this ? > sparkR.stop doesn't clean up .sparkRSQLsc in environment > > > Key: SPARK-11244 > URL: https://issues.apache.org/jira/browse/SPARK-11244 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sen Fang > > Currently {{sparkR.stop}} removes relevant variables from {{.sparkREnv}} for > SparkContext and backend. However it doesn't clean up {{.sparkRSQLsc}} and > {{.sparkRHivesc}}. > It results > {code} > sc <- sparkR.init("local") > sqlContext <- sparkRSQL.init(sc) > sparkR.stop() > sc <- sparkR.init("local") > sqlContext <- sparkRSQL.init(sc) > sqlContext > {code} > producing > {code} > sqlContext > Error in callJMethod(x, "getClass") : > Invalid jobj 1. If SparkR was restarted, Spark operations need to be > re-executed. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function
[ https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967807#comment-14967807 ] Shivaram Venkataraman commented on SPARK-11238: --- Also on this we should mark this as a breaking API change from 1.5 in the release notes cc [~rxin] [~pwendell] > SparkR: Documentation change for merge function > --- > > Key: SPARK-11238 > URL: https://issues.apache.org/jira/browse/SPARK-11238 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan > > As discussed in pull request: https://github.com/apache/spark/pull/9012, the > signature of the merge function will be changed, therefore documentation > change is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-11294: -- Fix Version/s: (was: 1.5.2) 1.5.3 > Improve R doc for read.df, write.df, saveAsTable > > > Key: SPARK-11294 > URL: https://issues.apache.org/jira/browse/SPARK-11294 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 1.5.3, 1.6.0 > > > API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-11258: -- Assignee: Frank Rosner > Converting a Spark DataFrame into an R data.frame is slow / requires a lot of > memory > > > Key: SPARK-11258 > URL: https://issues.apache.org/jira/browse/SPARK-11258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Frank Rosner >Assignee: Frank Rosner > Fix For: 1.6.0 > > > h4. Problem > We tried to collect a DataFrame with > 1 million rows and a few hundred > columns in SparkR. This took a huge amount of time (much more than in the > Spark REPL). When looking into the code, I found that the > {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and > then {{.toArray}} which might cause the problem. > h4. Solution > Directly transpose the row wise representation to the column wise > representation with one pass through the data. I will create a pull request > for this. > h4. Runtime comparison > On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} > method takes average 2267 ms to complete. My implementation takes only 554 ms > on average. This effect might be due to garbage collection, especially if you > consider that the old implementation didn't complete on an even bigger data > frame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10979) SparkR: Add merge to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-10979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10979: -- Assignee: Narine Kokhlikyan > SparkR: Add merge to DataFrame > -- > > Key: SPARK-10979 > URL: https://issues.apache.org/jira/browse/SPARK-10979 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Assignee: Narine Kokhlikyan > Fix For: 1.6.0 > > > Add merge function to DataFrame, which supports R signature. > https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10979) SparkR: Add merge to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-10979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10979. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9012 [https://github.com/apache/spark/pull/9012] > SparkR: Add merge to DataFrame > -- > > Key: SPARK-10979 > URL: https://issues.apache.org/jira/browse/SPARK-10979 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan > Fix For: 1.6.0 > > > Add merge function to DataFrame, which supports R signature. > https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org