Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
>From the documentation it states that ` The input columns should be of
DoubleType or FloatType.` so i dont think that is what im looking for. Also
in general the API around vectors is highly lacking, especially from the
pyspark side.

Very common vector operations like addition, subtractions and dot products
can't be performed. I'm wondering what the direction is with vector support
in spark.

On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz 
wrote:

> Since 2.2 there is Imputer:
>
> https://github.com/apache/spark/blob/branch-2.2/
> examples/src/main/python/ml/imputer_example.py
>
> which should at least partially address the problem.
>
> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> > I just wanted to highlight some of the rough edges around using
> > vectors in columns in dataframes.
> >
> > If there is a null in a dataframe column containing vectors pyspark ml
> > models like logistic regression will completely fail.
> >
> > However from what i've read there is no good way to fill in these
> > nulls with empty vectors.
> >
> > Its not possible to create a literal vector column expressiong and
> > coalesce it with the column from pyspark.
> >
> > so we're left with writing a python udf which does this coalesce, this
> > is really inefficient on large datasets and becomes a bottleneck for
> > ml pipelines working with real world data.
> >
> > I'd like to know how other users are dealing with this and what plans
> > there are to extend vector support for dataframes.
> >
> > Thanks!,
> >
> > Franklyn
>
> --
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer:

https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py

which should at least partially address the problem.

On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> I just wanted to highlight some of the rough edges around using
> vectors in columns in dataframes. 
>
> If there is a null in a dataframe column containing vectors pyspark ml
> models like logistic regression will completely fail. 
>
> However from what i've read there is no good way to fill in these
> nulls with empty vectors. 
>
> Its not possible to create a literal vector column expressiong and
> coalesce it with the column from pyspark.
>  
> so we're left with writing a python udf which does this coalesce, this
> is really inefficient on large datasets and becomes a bottleneck for
> ml pipelines working with real world data.
>
> I'd like to know how other users are dealing with this and what plans
> there are to extend vector support for dataframes.
>
> Thanks!,
>
> Franklyn

-- 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why does Spark SQL use custom spark.sql.execution.id local property not SparkContext.setJobGroup?

2017-06-21 Thread Jacek Laskowski
Hi,

Just noticed that Spark SQL uses spark.sql.execution.id local property
(via SQLExecution.withNewExecutionId [1]) to group Spark jobs
logically together while Structured Streaming uses
SparkContext.setJobGroup [2] to do the same.

I think Structured Streaming is more correct as it uses what Spark
Core introduced and uses in web UI (without introducing a custom
solution).

Why does Spark SQL introduce a custom solution based on
spark.sql.execution.id local property? What's wrong with
SparkContext.setJobGroup?

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L63
[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L265

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
I just wanted to highlight some of the rough edges around using vectors in
columns in dataframes.

If there is a null in a dataframe column containing vectors pyspark ml
models like logistic regression will completely fail.

However from what i've read there is no good way to fill in these nulls
with empty vectors.

Its not possible to create a literal vector column expressiong and coalesce
it with the column from pyspark.

so we're left with writing a python udf which does this coalesce, this is
really inefficient on large datasets and becomes a bottleneck for ml
pipelines working with real world data.

I'd like to know how other users are dealing with this and what plans there
are to extend vector support for dataframes.

Thanks!,

Franklyn


Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
ok, amplab.cs.berkeley.edu is back up and you can reach jenkins.

On Wed, Jun 21, 2017 at 4:18 PM, shane knapp  wrote:
> a lot of berkeley cs infrastructure we depend on is still down.  no
> ETA as to when they'll be up.
>
> On Wed, Jun 21, 2017 at 3:43 PM, shane knapp  wrote:
>> a construction crew working outside hit an underground power line, and
>> power has just been restored.  our servers are coming back up, and
>> access to jenkins should be restored shortly.
>>
>> On Wed, Jun 21, 2017 at 2:14 PM, shane knapp  wrote:
>>> ...it pours.
>>>
>>> we lost power in our building, including the machine room where
>>> amplab.cs.berkeley.edu lives.  jenkins is still up and you can visit
>>> the site by ignoring the reverse proxy:
>>> https://hadrian.ist.berkeley.edu/jenkins/
>>>
>>> the bad news is that pull request builds won't run.  ETA on power
>>> restoration is probably not until tonight.  i'll post more details as
>>> i get them.
>>>
>>> :\
>>>
>>> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
a lot of berkeley cs infrastructure we depend on is still down.  no
ETA as to when they'll be up.

On Wed, Jun 21, 2017 at 3:43 PM, shane knapp  wrote:
> a construction crew working outside hit an underground power line, and
> power has just been restored.  our servers are coming back up, and
> access to jenkins should be restored shortly.
>
> On Wed, Jun 21, 2017 at 2:14 PM, shane knapp  wrote:
>> ...it pours.
>>
>> we lost power in our building, including the machine room where
>> amplab.cs.berkeley.edu lives.  jenkins is still up and you can visit
>> the site by ignoring the reverse proxy:
>> https://hadrian.ist.berkeley.edu/jenkins/
>>
>> the bad news is that pull request builds won't run.  ETA on power
>> restoration is probably not until tonight.  i'll post more details as
>> i get them.
>>
>> :\
>>
>> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
a construction crew working outside hit an underground power line, and
power has just been restored.  our servers are coming back up, and
access to jenkins should be restored shortly.

On Wed, Jun 21, 2017 at 2:14 PM, shane knapp  wrote:
> ...it pours.
>
> we lost power in our building, including the machine room where
> amplab.cs.berkeley.edu lives.  jenkins is still up and you can visit
> the site by ignoring the reverse proxy:
> https://hadrian.ist.berkeley.edu/jenkins/
>
> the bad news is that pull request builds won't run.  ETA on power
> restoration is probably not until tonight.  i'll post more details as
> i get them.
>
> :\
>
> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
...it pours.

we lost power in our building, including the machine room where
amplab.cs.berkeley.edu lives.  jenkins is still up and you can visit
the site by ignoring the reverse proxy:
https://hadrian.ist.berkeley.edu/jenkins/

the bad news is that pull request builds won't run.  ETA on power
restoration is probably not until tonight.  i'll post more details as
i get them.

:\

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Imran Rashid
-1

I'm sorry for discovering this so late, but I just filed
https://issues.apache.org/jira/browse/SPARK-21165 which I think should be a
blocker, its a regression from 2.1

On Wed, Jun 21, 2017 at 1:43 PM, Nick Pentreath 
wrote:

> As before, release looks good, all Scala, Python tests pass. R tests fail
> with same issue in SPARK-21093 but it's not a blocker.
>
> +1 (binding)
>
>
> On Wed, 21 Jun 2017 at 01:49 Michael Armbrust 
> wrote:
>
>> I will kick off the voting with a +1.
>>
>> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc5
>>>  (62e442e73a2fa66
>>> 3892d2edaff5f7d72d7f402ed)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>>


Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail
with same issue in SPARK-21093 but it's not a blocker.

+1 (binding)


On Wed, 21 Jun 2017 at 01:49 Michael Armbrust 
wrote:

> I will kick off the voting with a +1.
>
> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc5
>>  (
>> 62e442e73a2fa663892d2edaff5f7d72d7f402ed)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Sean Owen
+1

Sigs/hashes look good. Tests pass on Java 8 / Ubuntu 17 with -Pyarn -Phive
-Phadoop-2.7 for me.

The only open issues for 2.2.0 are:

SPARK-21144 Unexpected results when the data schema and partition schema
have the duplicate columns
SPARK-18267 Distribute PySpark via Python Package Index (pypi)

The first one was created recently, and isn't marked as terribly important,
so should it be untargeted for 2.2? Not sure what the pypi status is so I
think that might be un-targetable too.

On Wed, Jun 21, 2017 at 12:49 AM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc5
>  (
> 62e442e73a2fa663892d2edaff5f7d72d7f402ed)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1243/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


[build system] patching post-mortem: back to normal!

2017-06-21 Thread shane knapp
all systems were updated fully, as it had been over a year since i'd
last done it.  risky, i know but...

things that went right:
* a lot of vulnerabilities in the systems were patched.  short list:
  - CVE-2017-1000364 (stack guard)
  - CVE-2017-1000363 (stack overflow)
  - CVE-2017-1000366 (gnu C libs)
  - CVE-2017-1000369  (exim, stack overflow)
  - CVE-2017-1000367 (sudo)

* applying the updates for the workers was easy, and all rebooted w/o issue

* this should hopefully be the last time i update these centos boxes,
as the ubuntu staging workers are much more solid and easier to deal
with (as well as being completely ansible-ized)

things that went wrong:
* update to system pypy package overwrote the symlink /usr/bin/pypy
and changed it to point back to pypy-2.0.2.  i had to delete the
symlink and create a new one pointing at
/usr/lib64/pypy-2.5.1/bin/pypy

* all of the R-3.1.1 packages i installed manually via yum were
updated, causing the PRB to hang.  after uninstalling the updated
RPMs, reinstalling the original ones and rebuilding the CRAN packages
PRB builds went green

things are looking good right now, but please don't hesitate to ping
me here (or on github:  @shaneknapp) if something looks amiss.

thanks again, and sorry about the inconvenience!

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Michael Armbrust
This vote fails.  Please test RC5.

On Jun 21, 2017 6:50 AM, "Nick Pentreath"  wrote:

> Thanks, I added the details of my environment to the JIRA (for what it's
> worth now, as the issue is identified)
>
> On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon  wrote:
>
>> Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093.
>>
>> 2017-06-14 17:08 GMT+09:00 Hyukjin Kwon :
>>
>>> For a shorter reproducer ...
>>>
>>>
>>> df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
>>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>>
>>> And running the below multiple times (5~7):
>>>
>>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>>
>>> looks occasionally throwing an error.
>>>
>>>
>>> I will leave here and probably explain more information if a JIRA is
>>> open. This does not look a regression anyway.
>>>
>>>
>>>
>>> 2017-06-14 16:22 GMT+09:00 Hyukjin Kwon :
>>>

 Per https://github.com/apache/spark/tree/v2.1.1,

 1. CentOS 7.2.1511 / R 3.3.3 - this test hangs.

 I messed it up a bit while downgrading the R to 3.3.3 (It was an actual
 machine not a VM) so it took me a while to re-try this.
 I re-built this again and checked the R version is 3.3.3 at least. I
 hope this one could double checked.

 Here is the self-reproducer:

 irisDF <- suppressWarnings(createDataFrame (iris))
 schema <-  structType(structField("Sepal_Length", "double"),
 structField("Avg", "double"))
 df4 <- gapply(
   cols = "Sepal_Length",
   irisDF,
   function(key, x) {
 y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE)
   },
   schema)
 collect(df4)



 2017-06-14 16:07 GMT+09:00 Felix Cheung :

> Thanks! Will try to setup RHEL/CentOS to test it out
>
> _
> From: Nick Pentreath 
> Sent: Tuesday, June 13, 2017 11:38 PM
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: Felix Cheung , Hyukjin Kwon <
> gurwls...@gmail.com>, dev 
>
> Cc: Sean Owen 
>
>
> Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have
> to report back later with the versions as am AFK.
>
> R version not totally sure but again will revert asap
> On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
> wrote:
>
>> Thanks
>> This was with an external package and unrelated
>>
>>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>
>> As for CentOS - would it be possible to test against R older than
>> 3.4.0? This is the same error reported by Nick below.
>>
>> _
>> From: Hyukjin Kwon 
>> Sent: Tuesday, June 13, 2017 8:02 PM
>>
>> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
>> To: dev 
>> Cc: Sean Owen , Nick Pentreath <
>> nick.pentre...@gmail.com>, Felix Cheung 
>>
>>
>>
>> For the test failure on R, I checked:
>>
>>
>> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>>
>> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
>> https://ci.appveyor.com/project/spark-test/spark/
>> build/755-r-test-v2.2.0-rc4)
>> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
>> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>> HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>>
>>
>> Per https://github.com/apache/spark/tree/v2.1.1,
>>
>> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (https://gist.github.com/
>> HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>>
>>
>> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
>> tests and observations.
>>
>> This is failed in Spark 2.1.1. So, it sounds not a regression
>> although it is a bug that should be fixed (whether in Spark or R).
>>
>>
>> 2017-06-14 8:28 GMT+09:00 Xiao Li :
>>
>>> -1
>>>
>>> Spark 2.2 is unable to read the partitioned table created by Spark
>>> 2.1 or earlier.
>>>
>>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>>
>>> Will fix it soon.
>>>
>>> Thanks,
>>>
>>> Xiao Li
>>>
>>>
>>>
>>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley :
>>>
 Re: the QA JIRAs:
 Thanks for discussing them.  I still feel they are very helpful; I
 particularly notice not having to spend a solid 2-3 weeks of time QAing
 (unlike in earlier Spark releases).  One other point not mentioned 
 above: I
 think they serve as a very helpful reminder/training for the community 
 for
 rigor in development.  Since we instituted QA JIRAs, contributors have 

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Thanks, I added the details of my environment to the JIRA (for what it's
worth now, as the issue is identified)

On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon  wrote:

> Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093.
>
> 2017-06-14 17:08 GMT+09:00 Hyukjin Kwon :
>
>> For a shorter reproducer ...
>>
>>
>> df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> And running the below multiple times (5~7):
>>
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> looks occasionally throwing an error.
>>
>>
>> I will leave here and probably explain more information if a JIRA is
>> open. This does not look a regression anyway.
>>
>>
>>
>> 2017-06-14 16:22 GMT+09:00 Hyukjin Kwon :
>>
>>>
>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>
>>> 1. CentOS 7.2.1511 / R 3.3.3 - this test hangs.
>>>
>>> I messed it up a bit while downgrading the R to 3.3.3 (It was an actual
>>> machine not a VM) so it took me a while to re-try this.
>>> I re-built this again and checked the R version is 3.3.3 at least. I
>>> hope this one could double checked.
>>>
>>> Here is the self-reproducer:
>>>
>>> irisDF <- suppressWarnings(createDataFrame (iris))
>>> schema <-  structType(structField("Sepal_Length", "double"),
>>> structField("Avg", "double"))
>>> df4 <- gapply(
>>>   cols = "Sepal_Length",
>>>   irisDF,
>>>   function(key, x) {
>>> y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE)
>>>   },
>>>   schema)
>>> collect(df4)
>>>
>>>
>>>
>>> 2017-06-14 16:07 GMT+09:00 Felix Cheung :
>>>
 Thanks! Will try to setup RHEL/CentOS to test it out

 _
 From: Nick Pentreath 
 Sent: Tuesday, June 13, 2017 11:38 PM
 Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
 To: Felix Cheung , Hyukjin Kwon <
 gurwls...@gmail.com>, dev 

 Cc: Sean Owen 


 Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have
 to report back later with the versions as am AFK.

 R version not totally sure but again will revert asap
 On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
 wrote:

> Thanks
> This was with an external package and unrelated
>
>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>
> As for CentOS - would it be possible to test against R older than
> 3.4.0? This is the same error reported by Nick below.
>
> _
> From: Hyukjin Kwon 
> Sent: Tuesday, June 13, 2017 8:02 PM
>
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: dev 
> Cc: Sean Owen , Nick Pentreath <
> nick.pentre...@gmail.com>, Felix Cheung 
>
>
>
> For the test failure on R, I checked:
>
>
> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>
> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
> https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4
> )
> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>
>
> Per https://github.com/apache/spark/tree/v2.1.1,
>
> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>
>
> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my
> tests and observations.
>
> This is failed in Spark 2.1.1. So, it sounds not a regression although
> it is a bug that should be fixed (whether in Spark or R).
>
>
> 2017-06-14 8:28 GMT+09:00 Xiao Li :
>
>> -1
>>
>> Spark 2.2 is unable to read the partitioned table created by Spark
>> 2.1 or earlier.
>>
>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>
>> Will fix it soon.
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>>
>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley :
>>
>>> Re: the QA JIRAs:
>>> Thanks for discussing them.  I still feel they are very helpful; I
>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>> (unlike in earlier Spark releases).  One other point not mentioned 
>>> above: I
>>> think they serve as a very helpful reminder/training for the community 
>>> for
>>> rigor in development.  Since we instituted QA JIRAs, contributors have 
>>> been
>>> a lot better about adding in docs early, rather than waiting until the 
>>> end
>>> of the cycle (though I know this is drawing conclusions from 
>>> correlations).
>>>
>>> I would vote in favor of th