from:"Felix Cheung"

[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398174#comment-16398174
 ] 

Felix Cheung commented on SPARK-23650:
--

I see one RRunner - do you have more of the log?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-13 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397213#comment-16397213
 ] 

Felix Cheung commented on SPARK-23632:
--

could you explain how you think these environment variables can help?

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-13 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397211#comment-16397211
 ] 

Felix Cheung commented on SPARK-23618:
--

[~foxish] - Jira has a different user role system, I've added you to the right 
role now, could you try again?

 

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Accept Pinot into Apache Incubator

2018-03-13 Thread Felix Cheung

+1

On Sun, Mar 11, 2018 at 5:34 AM Willem Jiang  wrote:

> +1 (binding)
>
>
> Willem Jiang
>
> Blog: http://willemjiang.blogspot.com (English)
>   http://jnn.iteye.com  (Chinese)
> Twitter: willemjiang
> Weibo: 姜宁willem
>
> On Sun, Mar 11, 2018 at 7:51 PM, Pierre Smits 
> wrote:
>
> > +1
> >
> >
> >
> > Best regards,
> >
> > Pierre Smits
> >
> > V.P. Apache Trafodion
> >
> > On Sun, Mar 11, 2018 at 12:59 AM, Julian Hyde  wrote:
> >
> > > +1 (binding)
> > >
> > > Ironic that Druid — a similar project — has just entered incubation
> too.
> > > But of course that is not a conflict. Both are great projects. Good
> luck!
> > >
> > > Julian
> > >
> > >
> > > > On Mar 9, 2018, at 7:37 PM, Carl Steinbach  wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > On Fri, Mar 9, 2018, 7:29 PM kishore g  wrote:
> > > >
> > > >> Added Jim Jagielski to the mentor's list.
> > > >>
> > > >> On Fri, Mar 9, 2018 at 6:35 PM, Olivier Lamy 
> > wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> On 9 March 2018 at 17:11, kishore g  wrote:
> > > >>>
> > >  Hi all,
> > > 
> > >  I would like to call a VOTE to accept Pinot into the Apache
> > Incubator.
> > > >>> The
> > >  full proposal is available on the wiki
> > >  
> > > 
> > >  Please cast your vote:
> > > 
> > >   [ ] +1, bring Pinot into Incubator
> > >   [ ] +0, I don't care either way,
> > >   [ ] -1, do not bring Pinot into Incubator, because...
> > > 
> > >  The vote will open at least for 72 hours and only votes from the
> > > >>> Incubator
> > >  PMC are binding.
> > > 
> > >  Thanks,
> > >  Kishore G
> > > 
> > >  Discussion thread:
> > > 
> https://lists.apache.org/thread.html/8119f9478ea1811371f1bf6685290b
> > >  22b57b1a3e0849d1d778d77dcb@%3Cgeneral.incubator.apache.org
> > > 
> > > 
> > >  = Pinot Proposal =
> > > 
> > >  == Abstract ==
> > > 
> > >  Pinot is a distributed columnar storage engine that can ingest
> data
> > in
> > >  real-time and serve analytical queries at low latency. There are
> two
> > > >>> modes
> > >  of data ingestion - batch and/or realtime. Batch mode allows users
> > to
> > >  generate pinot segments externally using systems such as Hadoop.
> > These
> > >  segments can be uploaded into Pinot via simple curl calls. Pinot
> can
> > > >>> ingest
> > >  data in near real-time from streaming sources such as Kafka. Data
> > > >>> ingested
> > >  into Pinot is stored in a columnar format. Pinot provides a SQL
> like
> > >  interface (PQL) that supports filters, aggregations, and group by
> > >  operations. It does not support joins by design, in order to
> > guarantee
> > >  predictable latency. It leverages other Apache projects such as
> > > >>> Zookeeper,
> > >  Kafka, and Helix, along with many libraries from the ASF.
> > > 
> > >  == Proposal ==
> > > 
> > >  Pinot was open sourced by LinkedIn and hosted on GitHub. Majority
> of
> > > >> the
> > >  development happens at LinkedIn with other contributions from Uber
> > and
> > >  Slack. We believe that being a part of Apache Software Foundation
> > will
> > >  improve the diversity and help form a strong community around the
> > > >>> project.
> > > 
> > >  LinkedIn submits this proposal to donate the code base to Apache
> > > >> Software
> > >  Foundation. The code is already under Apache License 2.0.  Code
> and
> > > the
> > >  documentation are hosted on Github.
> > >  * Code: http://github.com/linkedin/pinot
> > >  * Documentation: https://github.com/linkedin/pinot/wiki
> > > 
> > > 
> > >  == Background ==
> > > 
> > >  LinkedIn, similar to other companies, has many applications that
> > > >> provide
> > >  rich real-time insights to members and customers (internal and
> > > >> external).
> > >  The workload characteristics for these applications vary a lot.
> Some
> > >  internal applications simply need ad-hoc query capabilities with
> > > >>> sub-second
> > >  to multiple seconds latency. But external site facing applications
> > > >>> require
> > >  strong SLA even very high workloads. Prior to Pinot, LinkedIn had
> > > >>> multiple
> > >  solutions depending on the workload generated by the application
> and
> > > >> this
> > >  was inefficient. Pinot was developed to be the one single platform
> > > that
> > >  addresses all classes of applications. Today at LinkedIn, Pinot
> > powers
> > > >>> more
> > >  than 50 site facing products with workload ranging from few
> queries
> > > per
> > >  second to 1000’s of queries per second while maintaining the 99th
> > >  percentile latency which can be as low as few milliseconds. All
> > > >> internal
> > >  dashboards at LinkedIn are powered by Pinot.
> > > 
> > >  == Rationale ==
> > > 
> > >  We b

[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-12 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396607#comment-16396607
 ] 

Felix Cheung commented on SPARK-23650:
--

which system/platform are you running on?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-12 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396606#comment-16396606
 ] 

Felix Cheung commented on SPARK-23632:
--

well, if download of packages is taking that long, then there isn't much on the 
R side we can do. Perhaps spark package code can be changed to perform the 
download asynchronously but then we would need to check if the JVM is ready.

We could also increase the timeout but there are times if JVM is unresponsive, 
a short timeout can be useful to provide a quick termination.

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Apache Pinot Incubator Proposal

2018-03-09 Thread Felix Cheung

Hi Kishore - do you need one more mentor?


On Tue, Feb 13, 2018 at 12:10 AM kishore g  wrote:

> Hello,
>
> I would like to propose Pinot as an Apache Incubator project. The proposal
> is available as a draft at https://wiki.apache.org/incubator/PinotProposal.
> I
> have also included the text of the proposal below.
>
> Any feedback from the community is much appreciated.
>
> Regards,
> Kishore G
>
> = Pinot Proposal =
>
> == Abstract ==
>
> Pinot is a distributed columnar storage engine that can ingest data in
> real-time and serve analytical queries at low latency. There are two modes
> of data ingestion - batch and/or realtime. Batch mode allows users to
> generate pinot segments externally using systems such as Hadoop. These
> segments can be uploaded into Pinot via simple curl calls. Pinot can ingest
> data in near real-time from streaming sources such as Kafka. Data ingested
> into Pinot is stored in a columnar format. Pinot provides a SQL like
> interface (PQL) that supports filters, aggregations, and group by
> operations. It does not support joins by design, in order to guarantee
> predictable latency. It leverages other Apache projects such as Zookeeper,
> Kafka, and Helix, along with many libraries from the ASF.
>
> == Proposal ==
>
> Pinot was open sourced by LinkedIn and hosted on GitHub. Majority of the
> development happens at LinkedIn with other contributions from Uber and
> Slack. We believe that being a part of Apache Software Foundation will
> improve the diversity and help form a strong community around the project.
>
> LinkedIn submits this proposal to donate the code base to Apache Software
> Foundation. The code is already under Apache License 2.0.  Code and the
> documentation are hosted on Github.
>  * Code: http://github.com/linkedin/pinot
>  * Documentation: https://github.com/linkedin/pinot/wiki
>
>
> == Background ==
>
> LinkedIn, similar to other companies, has many applications that provide
> rich real-time insights to members and customers (internal and external).
> The workload characteristics for these applications vary a lot. Some
> internal applications simply need ad-hoc query capabilities with sub-second
> to multiple seconds latency. But external site facing applications require
> strong SLA even very high workloads. Prior to Pinot, LinkedIn had multiple
> solutions depending on the workload generated by the application and this
> was inefficient. Pinot was developed to be the one single platform that
> addresses all classes of applications. Today at LinkedIn, Pinot powers more
> than 50 site facing products with workload ranging from few queries per
> second to 1000’s of queries per second while maintaining the 99th
> percentile latency which can be as low as few milliseconds. All internal
> dashboards at LinkedIn are powered by Pinot.
>
> == Rationale ==
>
> We believe that requirement to develop rich real-time analytic applications
> is applicable to other organizations. Both Pinot and the interested
> communities would benefit from this work being openly available.
>
> == Current Status ==
>
> Pinot is currently open sourced under the Apache License Version 2.0 and
> available at github.com/linkedin/pinot. All the development is done using
> GitHub Pull Requests. We cut releases on a weekly basis and deploy it at
> LinkedIn. mp-0.1.468 is the latest release tag that is deployed in
> production.
>
> == Meritocracy ==
>
> Following the Apache meritocracy model, we intend to build an open and
> diverse community around Pinot. We will encourage the community to
> contribute to discussion and codebase.
>
> == Community ==
>
> Pinot is currently used extensively at LinkedIn and Uber. Several companies
> have expressed interest in the project. We hope to extend the contributor
> base significantly by bringing Pinot into Apache.
>
> == Core Developers ==
>
> Pinot was started by engineers at LinkedIn, and now has committers from
> Uber.
>
> == Alignment ==
>
> Apache is the most natural home for taking Pinot forward. Pinot leverages
> several existing Apache Projects such as Kafka, Helix, Zookeeper, and Avro.
> As Pinot gains adoption, we plan to add support for the ORC and Parquet
> formats, as well as adding integration with Yarn and Mesos.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Pinot project being abandoned is minimal. The teams at
> LinkedIn and Uber are highly incentivized to continue development of Pinot
> as it is a critical part of their infrastructure.
>
> === Inexperience with Open Source ===
>
> Post open sourcing, Pinot was completely developed on GitHub. All the
> current developers on Pinot are well aware of the open source development
> process. However, most of the developers are new to the Apache process.
> Kishore Gopalakrishna, one of the lead developers in Pinot, is VP and
> committer of the Apache Helix project.
>
> === Homogenous Developers ===
>
> The current core developers are all from LinkedIn and Uber. However,

[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392507#comment-16392507
 ] 

Felix Cheung commented on SPARK-23632:
--

To clarify, are you running into problem because the package download is taking 
longer than the fixed 10 sec?

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-03-07 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23291:
-
Affects Version/s: 2.1.2
   2.2.0
   2.3.0

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-03-07 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23291.
--
  Resolution: Fixed
Assignee: Liang-Chi Hsieh
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

IPMC join request

2018-03-06 Thread Felix Cheung

Hi all,

I'd like to join IPMC, initially to help mentor Dr Elephant as incubator
project but also looking forward to help mentor other Apache incubator
projects.

I am PPMC/PMC of Apache Zeppelin (since incubation to TLP) and PMC of
Apache Spark, Release Manager for releases.

Thanks!
Felix

[jira] [Assigned] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2018-03-05 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-22430:


Assignee: Rekha Joshi

> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.4.0
>
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unknown tag
&g

[jira] [Resolved] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2018-03-05 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22430.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>Priority: Trivial
> Fix For: 2.4.0
>
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unk

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung

For pyspark specifically IMO should be very high on the list to port back...

As for roadmap - should be sharing more soon.

From: lucas.g...@gmail.com 
Sent: Friday, March 2, 2018 9:41:46 PM
To: user@spark.apache.org
Cc: Felix Cheung
Subject: Re: Question on Spark-kubernetes integration

Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it 
would be part of what got merged.

Is there a roadmap in terms of when that may get merged up?

Thanks!

On 2 March 2018 at 21:32, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
That’s in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling

From: Lalwani, Jayesh 
mailto:jayesh.lalw...@capitalone.com>>
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung

That's in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Welcoming some new committers

2018-03-02 Thread Felix Cheung

Congrats and welcome!


From: Dongjoon Hyun 
Sent: Friday, March 2, 2018 4:27:10 PM
To: Spark dev list
Subject: Re: Welcoming some new committers

Congrats to all!

Bests,
Dongjoon.

On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Congratulations to everyone and welcome!

On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 
mailto:c...@koeninger.org>> wrote:
Congrats to the new committers, and I appreciate the vote of confidence.

On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 
mailto:matei.zaha...@gmail.com>> wrote:
> Hi everyone,
>
> The Spark PMC has recently voted to add several new committers to the 
> project, based on their contributions to Spark 2.3 and other past work:
>
> - Anirudh Ramanathan (contributor to Kubernetes support)
> - Bryan Cutler (contributor to PySpark and Arrow support)
> - Cody Koeninger (contributor to streaming and Kafka support)
> - Erik Erlandson (contributor to Kubernetes support)
> - Matt Cheah (contributor to Kubernetes support and other parts of Spark)
> - Seth Hendrickson (contributor to MLlib and PySpark)
>
> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
> committers!
>
> Matei
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

Re: Using bundler for Jekyll?

2018-03-01 Thread Felix Cheung

Also part of the problem is that the latest news panel is static on each page, 
so any new link added changes hundreds of files?

From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Thursday, March 1, 2018 6:36:43 PM
To: dev
Subject: Using bundler for Jekyll?

One of the things which comes up when folks update the Spark website is that we 
often get lots of unnecessarily changed files. I _think_ some of this might 
come from different jekyll versions on different machines, would folks be OK if 
we added a requirements that folks use bundler so we can have more consistent 
versions?

--
Twitter: https://twitter.com/holdenkarau

Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung

I had agreed it was a compromise when it was proposed back in May 2017.

I don’t think I can capture the long reviews and many discussed that went in, 
for further discussion please start from JIRA SPARK-20889.




From: Marcelo Vanzin 
Sent: Tuesday, February 27, 2018 10:26:23 AM
To: Felix Cheung
Cc: Mihály Tóth; Mihály Tóth; dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

I followed Misi's instructions:
- click on 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html
- click on "s" at the top
- find "sin" and click on it

And that does not give me the documentation for the "sin" function.
That leads to you to a really ugly list of functions that's basically
unreadable. There's lots of things like this:

## S4 method for signature 'Column'
abs(x)

Which look to me like the docs weren't properly generated. So it
doesn't look like it's a discoverability problem, it seems there's
something odd going on with the new docs.

On the previous version those same steps take me to a nicely formatted
doc for the "sin" function.



On Tue, Feb 27, 2018 at 10:14 AM, Felix Cheung
 wrote:
> I think what you are calling out is discoverability of names from index - I
> agree this should be improved.
>
> There are several reasons for this change, if I recall, some are:
>
> - we have too many doc pages and a very long index page because of the
> atypical large number of functions - many R packages only have dozens (or a
> dozen) and we have hundreds; this also affects discoverability
>
> - a side effect of high number of functions is that we have hundreds of
> pages of cross links between functions in the same and different categories
> that are very hard to read or find
>
> - many function examples are too simple or incomplete - it would be good to
> make them runnable, for instance
>
> There was a proposal for a search feature on the doc index at one point, IMO
> that would be very useful and would address the discoverability issue.
>
>
> 
> From: Mihály Tóth 
> Sent: Tuesday, February 27, 2018 9:13:18 AM
> To: Felix Cheung
> Cc: Mihály Tóth; dev@spark.apache.org
>
> Subject: Re: Help needed in R documentation generation
>
> Hi,
>
> Earlier, at https://spark.apache.org/docs/latest/api/R/index.html I see
>
> sin as a title
> description describes what sin does
> usage, arguments, note, see also are specific to sin function
>
> When opening sin from
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html:
>
> Title is 'Math functions for Column operations', not very specific to sin
> Description is 'Math functions defined for Column.'
> Usage contains a list of functions, scrolling down you can see sin as well
> though ...
>
> To me that sounds like a problem. Do I overlook something here?
>
> Best Regards,
>   Misi
>
>
> 2018-02-27 16:15 GMT+00:00 Felix Cheung :
>>
>> The help content on sin is in
>>
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html
>>
>> It’s a fairly long list but sin is in there. Is that not what you are
>> seeing?
>>
>>
>> 
>> From: Mihály Tóth 
>> Sent: Tuesday, February 27, 2018 8:03:34 AM
>> To: dev@spark.apache.org
>> Subject: Fwd: Help needed in R documentation generation
>>
>> Hi,
>>
>> Actually, when I open the link you provided and click on - for example -
>> 'sin' the page does not seem to describe that function at all. Actually I
>> get same effect that I get locally. I have attached a screenshot about that:
>>
>>
>>
>>
>>
>> I tried with Chrome and then with Safari too and got the same result.
>>
>> When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark
>> 2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc.
>> sections.
>>
>> This sounds like a bug in the documentation of Spark R, does'nt it? Shall
>> I file a Jira about it?
>>
>> Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully.
>> Unfortunately with the result mentioned above.
>>
>> Best Regards,
>>
>>   Misi
>>
>>
>>>
>>> 
>>>
>>> From: Felix Cheung 
>>> Date: 2018-02-26 20:42 GMT+00:00
>>> Subject: Re: Help needed in R documentation generation
>>> To: Mihály Tóth 
>>> Cc: "dev@spark.apache.org" 
>>>
>&

Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung

I think what you are calling out is discoverability of names from index - I 
agree this should be improved.

There are several reasons for this change, if I recall, some are:

- we have too many doc pages and a very long index page because of the atypical 
large number of functions - many R packages only have dozens (or a dozen) and 
we have hundreds; this also affects discoverability

- a side effect of high number of functions is that we have hundreds of pages 
of cross links between functions in the same and different categories that are 
very hard to read or find

- many function examples are too simple or incomplete - it would be good to 
make them runnable, for instance

There was a proposal for a search feature on the doc index at one point, IMO 
that would be very useful and would address the discoverability issue.



From: Mihály Tóth 
Sent: Tuesday, February 27, 2018 9:13:18 AM
To: Felix Cheung
Cc: Mihály Tóth; dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

Hi,

Earlier, at https://spark.apache.org/docs/latest/api/R/index.html I see

  1.  sin as a title
  2.  description describes what sin does
  3.  usage, arguments, note, see also are specific to sin function

When opening sin from 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html:

  1.  Title is 'Math functions for Column operations', not very specific to sin
  2.  Description is 'Math functions defined for Column.'
  3.  Usage contains a list of functions, scrolling down you can see sin as 
well though ...

To me that sounds like a problem. Do I overlook something here?

Best Regards,
  Misi


2018-02-27 16:15 GMT+00:00 Felix Cheung 
mailto:felixcheun...@hotmail.com>>:
The help content on sin is in
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html

It’s a fairly long list but sin is in there. Is that not what you are seeing?



From: Mihály Tóth mailto:mt...@cloudera.com>>
Sent: Tuesday, February 27, 2018 8:03:34 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: Help needed in R documentation generation

Hi,

Actually, when I open the link you provided and click on - for example - 'sin' 
the page does not seem to describe that function at all. Actually I get same 
effect that I get locally. I have attached a screenshot about that:


[Szövegközi kép 1]


I tried with Chrome and then with Safari too and got the same result.

When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark 
2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc. 
sections.

This sounds like a bug in the documentation of Spark R, does'nt it? Shall I 
file a Jira about it?

Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully. 
Unfortunately with the result mentioned above.

Best Regards,

  Misi





From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Date: 2018-02-26 20:42 GMT+00:00
Subject: Re: Help needed in R documentation generation
To: Mihály Tóth mailto:misut...@gmail.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>


Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html

____
From: Mihály Tóth mailto:misut...@gmail.com>>
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi


On Sun, Feb 25, 2018, 22:53 Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?


From: Mihály Tóth mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

http://column_math_functions.ht>ml">asin

Have you met with such a problem?

Thanks,

  Misi

Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung

The help content on sin is in
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html

It’s a fairly long list but sin is in there. Is that not what you are seeing?

From: Mihály Tóth 
Sent: Tuesday, February 27, 2018 8:03:34 AM
To: dev@spark.apache.org
Subject: Fwd: Help needed in R documentation generation

Hi,

Actually, when I open the link you provided and click on - for example - 'sin' 
the page does not seem to describe that function at all. Actually I get same 
effect that I get locally. I have attached a screenshot about that:

[Szövegközi kép 1]

I tried with Chrome and then with Safari too and got the same result.

When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark 
2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc. 
sections.

This sounds like a bug in the documentation of Spark R, does'nt it? Shall I 
file a Jira about it?

Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully. 
Unfortunately with the result mentioned above.

Best Regards,

  Misi

--------

From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Date: 2018-02-26 20:42 GMT+00:00
Subject: Re: Help needed in R documentation generation
To: Mihály Tóth mailto:misut...@gmail.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>

Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html

From: Mihály Tóth mailto:misut...@gmail.com>>
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi

On Sun, Feb 25, 2018, 22:53 Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?

From: Mihály Tóth mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

http://column_math_functions.ht>ml">asin

Have you met with such a problem?

Thanks,

  Misi

Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung

Yes you were pointing to HDFS on a loopback address...

From: Jenna Hoole 
Sent: Monday, February 26, 2018 1:11:35 PM
To: Yinan Li; user@spark.apache.org
Subject: Re: Spark on K8s - using files fetched by init-container?

Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running 
now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
OK, it looks like you will need to use 
`file:///var/spark-data/spark-files/flights.csv` instead. The 'file://' scheme 
must be explicitly used as it seems it defaults to 'hdfs' in your setup.

On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole 
mailto:jenna.ho...@gmail.com>> wrote:
Thank you for the quick response! However, I'm still having problems.

When I try to look for /var/spark-data/spark-files/flights.csv I get told:

Error: Error in loadDF : analysis error - Path does not exist: 
hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

And when I try to look for local:///var/spark-data/spark-files/flights.csv, I 
get:

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such 
file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

I can see from a kubectl describe that the directory is getting mounted.

Mounts:

  /etc/hadoop/conf from hadoop-properties (rw)

/var/run/secrets/kubernetes.io/serviceaccount
 from spark-token-pxz79 (ro)

  /var/spark-data/spark-files from download-files (rw)

  /var/spark-data/spark-jars from download-jars-volume (rw)

  /var/spark/tmp from spark-local-dir-0-tmp (rw)

Is there something else I need to be doing in my set up?

Thanks,
Jenna

On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
The files specified through --files are localized by the init-container to 
/var/spark-data/spark-files by default. So in your case, the file should be 
located at /var/spark-data/spark-files/flights.csv locally in the container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole 
mailto:jenna.ho...@gmail.com>> wrote:
This is probably stupid user error, but I can't for the life of me figure out 
how to access the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the 
path to its datafile. I supply the hdfs location via --files and then the full 
hdfs path.

--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 at 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 to 
/var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 
'hdfs://192.168.0.1:8020/user/jhoole/flights.csv':
 No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv

Error: Error in loadDF : analysis error - Path does not exist: 
hdfs://192.168.0.1:8020/user/root/flights.csv;

Execution halted

Exception in thread

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Felix Cheung

+1

Tested R:

install from package, CRAN tests, manual tests, help check, vignettes check

Filed this https://issues.apache.org/jira/browse/SPARK-23461
This is not a regression so not a blocker of the release.

Tested this on win-builder and r-hub. On r-hub on multiple platforms everything 
passed. For win-builder tests failed on x86 but passed x64 - perhaps due to an 
intermittent download issue causing a gzip error, re-testing now but won’t hold 
the release on this.


From: Nan Zhu 
Sent: Monday, February 26, 2018 4:03:22 PM
To: Michael Armbrust
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC5)

+1  (non-binding), tested with internal workloads and benchmarks

On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
+1 all our pipelines have been running the RC for several days now.

On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1 (non-binding).

Bests,
Dongjoon.



On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:
+1 (non-binding)

On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
+1 (binding) in Spark SQL, Core and PySpark.

Xiao

2018-02-24 14:49 GMT-08:00 Ricardo Almeida 
mailto:ricardo.alme...@actnowib.com>>:
+1 (non-binding)

same as previous RC

On 24 February 2018 at 11:10, Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
+1

2018-02-24 16:57 GMT+09:00 Bryan Cutler 
mailto:cutl...@gmail.com>>:
+1
Tests passed and additionally ran Arrow related tests and did some perf checks 
with python 2.7.14

On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Note: given the state of Jenkins I'd love to see Bryan Cutler or someone with 
Arrow experience sign off on this release.

On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
mailto:lian.cs@gmail.com>> wrote:

+1 (binding)

Passed all the tests, looks good.

Cheng

On 2/23/18 15:00, Holden Karau wrote:
+1 (binding)
PySpark artifacts install in a fresh Py3 virtual env

On Feb 23, 2018 7:55 AM, "Denny Lee" 
mailto:denny.g@gmail.com>> wrote:
+1 (non-binding)

On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough 
mailto:joshgoldsboroughs...@gmail.com>> wrote:
New to testing out Spark RCs for the community but I was able to run some of 
the basic unit tests without error so for what it's worth, I'm a +1.

On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal 
mailto:samee...@apache.org>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc5: 
https://github.com/apache/spark/tree/v2.3.0-rc5 
(992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1266/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hol

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377913#comment-16377913
 ] 

Felix Cheung commented on SPARK-23206:
--

[~elu] Hi Edwina, we are interesting in this as well. We have requirements on 
shuffle that we are currently looking into and a different approach on metric 
collections that we could discuss. Let me know if there is any 
sync/call/discussion being planned?

 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Help needed in R documentation generation

2018-02-26 Thread Felix Cheung

Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html

From: Mihály Tóth 
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi

On Sun, Feb 25, 2018, 22:53 Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?

From: Mihály Tóth mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

asin

Have you met with such a problem?

Thanks,

  Misi

Re: Help needed in R documentation generation

2018-02-25 Thread Felix Cheung

This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?

From: Mihály Tóth 
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

asin

Have you met with such a problem?

Thanks,

  Misi

Re: Github pull requests

2018-02-21 Thread Felix Cheung

Re JIRA - the merge PR script in Spark closes the JIRA automatically..

_
From: Julian Hyde 
Sent: Wednesday, February 21, 2018 8:46 PM
Subject: Re: Github pull requests
To: Jonas Pfefferle 
Cc: , Patrick Stuedi 


I believe that there are tools to do git / CI / JIRA integration. Spark is one 
of the projects with the most integration. Search their lists and JIRA to find 
out how they did it.

Speaking for my own project: Calcite doesn’t have very much integration because 
we don’t have spare cycles to research and troubleshoot. A documented manual 
process suffices.

Julian


> On Feb 21, 2018, at 2:26 AM, Jonas Pfefferle  wrote:
>
> We just closed our first pull request and where wondering if there is also a 
> way to automatically close the corresponding JIRA ticket? Also is there a way 
> we can technically enforce that we have a certain amount of people who 
> approved the code? Or do we have to do this informally?
>
> Thanks,
> Jonas
>
> On Wed, 14 Feb 2018 10:53:04 -0800
> Julian Hyde  wrote:
>> The nice thing about git is that every git repo is capable of being a master 
>> / slave. (The ASF git repo is special only in that it gathers audit logs 
>> when people push to it, e.g. the IP address where the push came from. Those 
>> logs will be useful if the provenance of our IP is ever challenged.)
>> So, the merging doesn’t happen on the GitHub repo. It happens in the repo on 
>> your laptop. Before merging, you pull the latest from the apache master 
>> branch (it doesn’t matter whether this comes from the GitHub mirror or the 
>> ASF repo - it is bitwise identical, as the commit SHAs will attest), and you 
>> pull from a GitHub repo the commit(s) referenced in the GitHub PR. You 
>> append these commits to the commit chain, test, then push to the ASF master 
>> branch.
>> If you add ‘Close #NN’ to the commit comments (and you generally will), an 
>> ASF commit hook will close PR #NN at the time that the commit arrives in ASF 
>> git.
>> Julian
>>> On Feb 14, 2018, at 6:59 AM, Jonas Pfefferle  wrote:
>>> I think you are missing a 3rd option:
>>> Basically option 1) but we merge the pull request on github and push the 
>>> changes to the apache git. So no need to delete the PRs. However we have to 
>>> be careful to only commit changes to github to not get the histories out of 
>>> sync.
>>> Jonas
>>> On Wed, 14 Feb 2018 13:58:58 +0100
>>> Patrick Stuedi  wrote:
 Hi all,
 If the github repo is synced with git repo only in one direction, then
 what is the recommended way to handle new code contributions
 (including code reviews)? We see two options here:
 1) Code contributions are issued as PRs on the Crail Apache github
 (and reviewed there), then merged outside in a private repo and
 committed back to the Apache git repo (the PR may need to be deleted
 once the commit has happened), from where the Apache Crail github repo
 will again pick it up (sync).
 2) We don't use the git repo at all, only the github repo. PRs are
 reviewed and merged directly at the github level.
 Option (1) looks complicated, option (2) might not be according to the
 Apache policies (?). What is the recommended way?
 -Patrick
 On Mon, Feb 12, 2018 at 5:25 PM, Julian Hyde  
 wrote:
> No.
> Julian
>> On Feb 12, 2018, at 08:03, Jonas Pfefferle  wrote:
>> Hi @all,
>> Is the Apache Crail github repository synced both ways with the Apache 
>> Crail git? I.e. can we merge pull request in github?
>> Regards,
>> Jonas
>

Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung

No it does not support bi directional edges as of now.

_
From: xiaobo 
Sent: Tuesday, February 20, 2018 4:35 AM
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships
To: Felix Cheung , 

So the question comes to does graphframes support bidirectional relationship 
natively with only one edge?

-- Original --
From: Felix Cheung 
Date: Tue,Feb 20,2018 10:01 AM
To: xiaobo , user@spark.apache.org 
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.

From: xiaobo 
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

Ah sorry I realize my wordings were unclear (not enough zzz or coffee)

So to clarify,
1) when searching for a word in the Sql function doc, it does return that 
search result page correctly, however, none of the link in result opens to the 
actual doc page, so to take the search I included as an example, if you click 
on approx_percentile, for instance, it brings open the web directory instead.

2) The second is the dist location we are voting on has a .iml file, which is 
normally not included in release or release RC and it is unsigned and without 
hash (therefore seems like it should not be in the release)

Thanks!

_
From: Shivaram Venkataraman 
Sent: Tuesday, February 20, 2018 2:24 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung 
Cc: Sean Owen , dev 


FWIW The search result link works for me

Shivaram

On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen mailto:sro...@gmail.com>>
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung mailto:felixcheun...@hotmail.com>>
Cc: dev mailto:dev@spark.apache.org>>



Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen 
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung 
Cc: dev 

Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung 
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

From: Sameer Agarwal 
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
mailto:samee...@apache.org>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html

FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).

--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag

Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-19 Thread Felix Cheung

Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.

From: xiaobo 
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks

[jira] [Updated] (SPARK-23461) vignettes should include model predictions for some ML models

2018-02-18 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23461:
-
Description: 
eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree - GBT, RF, DecisionTree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}

  was:
eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}


> vignettes should include model predictions for some ML models
> -
>
> Key: SPARK-23461
> URL: https://issues.apache.org/jira/browse/SPARK-23461
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> eg. 
> Linear Support Vector Machine (SVM) Classifier
> h4. Logistic Regression
> Tree - GBT, RF, DecisionTree
> (and ALS was disabled)
> By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-18 Thread Felix Cheung

Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml



From: Sameer Agarwal 
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
mailto:samee...@apache.org>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).



--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Felix Cheung

Hi - I’m maintaining it. As of now there is an issue with 2.2 that breaks 
personalized page rank, and that’s largely the reason there isn’t a release for 
2.2 support.

There are attempts to address this issue - if you are interested we would love 
for your help.


From: Nicolas Paris 
Sent: Sunday, February 18, 2018 12:31:27 AM
To: Denny Lee
Cc: xiaobo; user@spark.apache.org
Subject: Re: Does Pyspark Support Graphx?

> Most likely not as most of the effort is currently on GraphFrames  - a great
> blog post on the what GraphFrames offers can be found at: https://

Is the graphframes package still active ? The github repository
indicates it's not extremelly active. Right now, there is no available
package for spark-2.2 so that one need to compile it from sources.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[jira] [Created] (SPARK-23461) vignettes should include model predictions for some ML models

2018-02-18 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-23461:


 Summary: vignettes should include model predictions for some ML 
models
 Key: SPARK-23461
 URL: https://issues.apache.org/jira/browse/SPARK-23461
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1, 2.3.0
Reporter: Felix Cheung


eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-02-17 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368351#comment-16368351
 ] 

Felix Cheung commented on SPARK-23435:
--

Working on this. Debugging a problem.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23435) R tests should support latest testthat

2018-02-16 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23435:


Assignee: Felix Cheung

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23435) R tests should support latest testthat

2018-02-15 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-23435:


 Summary: R tests should support latest testthat
 Key: SPARK-23435
 URL: https://issues.apache.org/jira/browse/SPARK-23435
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Felix Cheung


To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was released 
in Dec 2017, and its method has been changed.

In order for our tests to keep working, we need to detect that and call a 
different method.

Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263
 ] 

Felix Cheung edited comment on SPARK-22817 at 2/15/18 9:13 AM:
---

I should have caught this - -we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file-

scratch that. in CRAN we are calling test_package, which works fine.


was (Author: felixcheung):
I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263
 ] 

Felix Cheung commented on SPARK-22817:
--

I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: NullPointerException in paragraph when getting batched TableEnvironment

2018-02-14 Thread Felix Cheung

Does it work within the Flink Scala Shell?

From: André Schütz 
Sent: Wednesday, February 14, 2018 4:02:30 AM
To: us...@zeppelin.incubator.apache.org
Subject: NullPointerException in paragraph when getting batched TableEnvironment

Hi,

within the Flink Interpreter context, we try to get a Batch
TableEnvironment with the following code.

[code]
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.sources._

val batchEnvironment  = benv
val batchTableEnvironment = TableEnvironment.getTableEnvironment
(batchEnvironment) [/code]

When executing the paragraph, we get the following error.

[error]
Caused by: java.lang.ExceptionInInitializerError:
java.lang.NullPointerException
Caused by: java.lang.NullPointerException
  at org.apache.flink.table.api.scala.BatchTableEnvironment.
(BatchTableEnvironment.scala:47) at
org.apache.flink.table.api.TableEnvironment$.getTableEnvironment
(TableEnvironment.scala:1049)
[/error]

Any ideas why there is the NullPointerException?

I am grateful for any ideas.

Kind regards,
Andre

--
Andre Schütz
COO / Founder - Wegtam GmbH
an...@wegtam.com | P: +49 (0) 381-80 699 041 | M: +49 (0) 176-218 02 604
www.wegtam.com | 
www.tensei-data.com | 
www.wegtam.net

Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung

Yes it is issue with the newer release of testthat.

To workaround could you install an earlier version with devtools? will follow 
up for a fix.

_
From: Hyukjin Kwon 
Sent: Wednesday, February 14, 2018 6:49 PM
Subject: Re: SparkR test script issue: unable to run run-tests.h on spark 2.2
To: chandan prakash 
Cc: user @spark 

>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details in 
https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?

On 14 Feb 2018 6:09 pm, "chandan prakash" 
mailto:chandanbaran...@gmail.com>> wrote:
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same 
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h

Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted

Any Help?

--
Chandan Prakash

[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357755#comment-16357755
 ] 

Felix Cheung commented on SPARK-23285:
--

Aounds reasonable to me



> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351333#comment-16351333
 ] 

Felix Cheung commented on SPARK-23314:
--

I've isolated this down to this particular file

[https://raw.githubusercontent.com/BuzzFeedNews/2016-04-federal-surveillance-planes/master/data/feds/feds3.csv]

without converting to pandas it seems to read fine, so not if it's a data 
problem.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351188#comment-16351188
 ] 

Felix Cheung commented on SPARK-23314:
--

Thanks. I have isolated this to a different subset of data, but not yet able to 
pinpoint the exact row (mostly the value displayed is local but the data is 
UTC, and there is no match after adjusting for time zone) It might be with the 
data so in such case is there a way to help debug this?


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350819#comment-16350819
 ] 

Felix Cheung commented on SPARK-23314:
--

Im running python 2
Pandas 0.22.0
Pyarrow 0.8.0



> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box. I'm able to reproduce this on the latest 
branch-2.3 (last change from Feb 1 UTC)

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>      Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349899#comment-16349899
 ] 

Felix Cheung commented on SPARK-23314:
--

[~icexelloss] [~bryanc]

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349898#comment-16349898
 ] 

Felix Cheung commented on SPARK-23314:
--

log


[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in 
_check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/02/01 19:17:26 WARN TaskSetManager: Lost task 7.0 in stage 3.0 (TID 205, 
localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349897#comment-16349897
 ] 

Felix Cheung commented on SPARK-23314:
--

code

 

>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Environment: (was: data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs


>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()
[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 229, in main
process()
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 257, in dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 235, in _create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 230, in create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/sql/types.py", line 
1733, in _check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.express

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349896#comment-16349896
 ] 

Felix Cheung commented on SPARK-23314:
--

data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Environment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-23314:


 Summary: Pandas grouped udf on dataset with timestamp column error 
 Key: SPARK-23314
 URL: https://issues.apache.org/jira/browse/SPARK-23314
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs


>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()
[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 229, in main
process()
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 257, in dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 235, in _create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 230, in create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/sql/types.py", line 
1733, in _check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collec

Re: data source v2 online meetup

2018-02-01 Thread Felix Cheung

+1 hangout

From: Xiao Li 
Sent: Wednesday, January 31, 2018 10:46:26 PM
To: Ryan Blue
Cc: Reynold Xin; dev; Wenchen Fen; Russell Spitzer
Subject: Re: data source v2 online meetup

Hi, Ryan,

wow, your Iceberg already used data source V2 API! That is pretty cool! I am 
just afraid these new APIs are not stable. We might deprecate or change some 
data source v2 APIs in the next version (2.4). Sorry for the inconvenience it 
might introduce.

Thanks for your feedback always,

Xiao

2018-01-31 15:54 GMT-08:00 Ryan Blue 
mailto:rb...@netflix.com.invalid>>:
Thanks for suggesting this, I think it's a great idea. I'll definitely attend 
and can talk about the changes that we've made DataSourceV2 to enable our new 
table format, Iceberg.

On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever 
that has already been committed is only the first version and we'd need more 
work post-2.3 to improve and stablize it.

I think at this point we should stop making changes to it in branch-2.3, and 
instead focus on using the existing API and getting feedback for 2.4. Would 
people be interested in doing an online hangout to discuss this, perhaps in the 
month of Feb?

It'd be more productive if people attending the hangout have tried the API by 
implementing some new sources or porting an existing source over.

--
Ryan Blue
Software Engineer
Netflix

[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23291:
-
Shepherd: Felix Cheung  (was: Hossein Falaki)

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342441#comment-16342441
 ] 

Felix Cheung commented on SPARK-23114:
--

Sure!


> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23114.
--
Resolution: Fixed

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342394#comment-16342394
 ] 

Felix Cheung commented on SPARK-23114:
--

Resolving.

[~sameerag] please see release note above.

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-27 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23117.
--
Resolution: Won't Fix
  Assignee: Felix Cheung

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341232#comment-16341232
 ] 

Felix Cheung commented on SPARK-23107:
--

Thanks
My bad RFormula does have a page

https://spark.apache.org/docs/2.2.0/ml-features.html#rformula


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340876#comment-16340876
 ] 

Felix Cheung commented on SPARK-23107:
--

We have never had any doc for it and it’s not new in 2.3.0 so I figure it’s not 
a blocker for the release


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-26 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23200:


Assignee: Santiago Saavedra  (was: Anirudh Ramanathan)

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Santiago Saavedra
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339748#comment-16339748
 ] 

Felix Cheung commented on SPARK-23213:
--

To clarify we don’t support RDD in R.

Anything you access via SparkR::: is not supported, that include unionRDD, is 
not supported.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339559#comment-16339559
 ] 

Felix Cheung commented on SPARK-23213:
--

If you have any specific on what you need - we should have alternative API you 
can use that is not RDD based.

Anything you access with SparkR::: (3 :) - this is accessing methods inside the 
namespace that is not exported. So they are not public API.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339522#comment-16339522
 ] 

Felix Cheung commented on SPARK-23213:
--

You can convert DataFrame into RDD
But again textFile and RDD (all RDD APIs) are not  supported public API, sorry.

It will help if you could elaborate on what you are trying to do and what you 
might need.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338987#comment-16338987
 ] 

Felix Cheung commented on SPARK-23213:
--

Try read.text instead?

[http://spark.apache.org/docs/latest/api/R/read.text.html]

SparkR:::textFile is an internal method. Is there a reason you need it?

> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-24 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338043#comment-16338043
 ] 

Felix Cheung commented on SPARK-23117:
--

I'm ok to sign off if we don't have example for SPARK-20307 or SPARK-21381.

Perhaps something we should explain more in ML guide - since changes go into 
python and scala APIs as well.

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23115.
--
Resolution: Fixed
  Assignee: Felix Cheung

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337947#comment-16337947
 ] 

Felix Cheung commented on SPARK-23115:
--

done

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333677#comment-16333677
 ] 

Felix Cheung edited comment on SPARK-23115 at 1/24/18 7:17 AM:
---

Another pass, we should add API doc for

-SPARK-20906 (PR pending)-


was (Author: felixcheung):
Another pass, we should add API doc for

SPARK-20906

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-23 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21727.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
  2.3.0
Target Version/s: 2.3.0

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
> Fix For: 2.3.0, 2.4.0
>
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts

2018-01-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336254#comment-16336254
 ] 

Felix Cheung commented on SPARK-22522:
--

It’s to the right place but not the supported plugin.

It’s curling the endpoint directly which could be fragile.


> Convert to apache-release to publish Maven artifacts 
> -
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> to publish to Nexus/repository.apache.org which can be promoted to maven 
> central (when release).
> this is the same repo we are publishing to today. this JIRA is only tracking 
> the tooling changes.
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts

2018-01-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336204#comment-16336204
 ] 

Felix Cheung commented on SPARK-22522:
--

No it’s not done AFAIK


> Convert to apache-release to publish Maven artifacts 
> -
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> to publish to Nexus/repository.apache.org which can be promoted to maven 
> central (when release).
> this is the same repo we are publishing to today. this JIRA is only tracking 
> the tooling changes.
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335872#comment-16335872
 ] 

Felix Cheung commented on SPARK-23114:
--

I’m merely asking if anyone would have some real workload to test this fix 
with, since it has been reported in earlier releases issues with job timeout so 
there must be some long running job?

As I don’t have access to real customer dataset.

Anyway, as for other issues you have reported I think we have had follow ups 
and would be great for everyone in the community to chime in.


> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333712#comment-16333712
 ] 

Felix Cheung edited comment on SPARK-23107 at 1/21/18 11:08 PM:


We don't have doc on RFormula but it'll be good idea to add now and also to 
allow for documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way


was (Author: felixcheung):
We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333733#comment-16333733
 ] 

Felix Cheung commented on SPARK-21727:
--

how are we doing?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333730#comment-16333730
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:03 PM:


[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many short/bursty tasks?

 


was (Author: felixcheung):
[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333725#comment-16333725
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:02 PM:


[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
 offset in SparkR GLM [https://github.com/apache/spark/pull/18831]
 stringIndexerOrderType
 handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc for SQL functions

 


was (Author: felixcheung):
[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333730#comment-16333730
 ] 

Felix Cheung commented on SPARK-23114:
--

[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333725#comment-16333725
 ] 

Felix Cheung commented on SPARK-23114:
--

[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333718#comment-16333718
 ] 

Felix Cheung edited comment on SPARK-23117 at 1/21/18 10:47 PM:


I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20619 SPARK-14659 SPARK-20899 - should have 
RFormula in ML guide


was (Author: felixcheung):
I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333717#comment-16333717
 ] 

Felix Cheung commented on SPARK-23116:
--

I did a pass.

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23116.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333718#comment-16333718
 ] 

Felix Cheung commented on SPARK-23117:
--

I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333716#comment-16333716
 ] 

Felix Cheung commented on SPARK-20307:
--

I think [~wm624] if you have the time

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBu

[jira] [Comment Edited] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333682#comment-16333682
 ] 

Felix Cheung edited comment on SPARK-20307 at 1/21/18 10:40 PM:


Hi Felix,
 I can do that but I have a family emergency lately. It will not occur soon.
 Best
 Joseph

 


was (Author: monday0927!):
Hi Felix,
I can do that but I have a family emergency lately. It will not occur soon.
Best
Joseph

On 1/21/18, 2:45 PM, "Felix Cheung (JIRA)"  wrote:


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333663#comment-16333663
 ] 
    
Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on 
how to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
StringIndexer
> 

>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid 
on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context 
loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", 
"that"), 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 
(TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$4: (string) => double)
> at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
> at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
&

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333712#comment-16333712
 ] 

Felix Cheung commented on SPARK-23107:
--

We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 10:30 PM:


[~wm624] would you like to add example of this in the API doc?

roxygen2 doc for spark.logit


was (Author: felixcheung):
[~wm624] would you like to add example of this in the API doc?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333708#comment-16333708
 ] 

Felix Cheung commented on SPARK-23108:
--

>From reviewing R, it would be good to document constrained optimization for 
>logistic regression SPARK-20906 (R guide just links to ML guide, so we should 
>add doc there)

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23118.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333707#comment-16333707
 ] 

Felix Cheung commented on SPARK-23118:
--

for programming guide, perhaps 

SPARK-20906

But it mostly just links to API doc and ML programming guide. Will add a 
comment on ML programming guide instead.

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333677#comment-16333677
 ] 

Felix Cheung commented on SPARK-23115:
--

Another pass, we should add API doc for

SPARK-20906

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 8:54 PM:
---

[~wm624] would you like to add example of this in the API doc?


was (Author: felixcheung):
[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22208:
-
Labels: releasenotes  (was: )

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on how 
to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mut

[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333659#comment-16333659
 ] 

Felix Cheung commented on SPARK-22208:
--

Is this documented in the SQL programming guide/ migration guide?

[~ZenWzh]

[~smilegator]

 

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333654#comment-16333654
 ] 

Felix Cheung commented on SPARK-20906:
--

[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 2571 matches

Mail list logo