GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
Hi Spark community,

I'd like to announce a new release of GraphFrames, a Spark Package for
DataFrame-based graphs!

*We strongly encourage all users to use this latest release for the bug fix
described below.*

*Critical bug fix*
This release fixes a bug in indexing vertices.  This may have affected your
results if:
* your graph uses non-Integer IDs and
* you use ConnectedComponents and other algorithms which are wrappers
around GraphX.
The bug occurs when the input DataFrame is non-deterministic. E.g., running
an algorithm on a DataFrame just loaded from disk should be fine in
previous releases, but running that algorithm on a DataFrame produced using
shuffling, unions, and other operators can cause incorrect results. This
issue is fixed in this release.

*New features*
* Python API for aggregateMessages for building custom graph algorithms
* Scala API for parallel personalized PageRank, wrapping the GraphX
implementation. This is only available when using GraphFrames with Spark
2.1+.

Support for Spark 1.6, 2.0, and 2.1

*Special thanks to Felix Cheung for his work as a new committer for
GraphFrames!*

*Full release notes*:
https://github.com/graphframes/graphframes/releases/tag/release-0.5.0
*Docs*: http://graphframes.github.io/
*Spark Package*: https://spark-packages.org/package/graphframes/graphframes
*Source*: https://github.com/graphframes/graphframes

Thanks to all contributors and to the community for feedback!
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [build system] jenkins got itself wedged...

2017-05-19 Thread shane knapp
last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp  wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp  wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp  wrote:
 yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
 getting some error messages in the logs...   looks like jenkins is
 thrashing on GC.

 now that i know what's up, i should be able to get this sorted today.

 On Thu, May 18, 2017 at 12:39 AM, Sean Owen  wrote:
> I'm not sure if it's related, but I still can't get Jenkins to test PRs. 
> For
> example, triggering it through the spark-prs.appspot.com UI gives me...
>
> https://spark-prs.appspot.com/trigger-jenkins/18012
>
> Internal Server Error
>
> That might be from the appspot app though?
>
> But posting "Jenkins test this please" on PRs doesn't seem to work, and I
> can't reach Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> On Thu, May 18, 2017 at 12:44 AM shane knapp  wrote:
>>
>> after another couple of restarts due to high load and system
>> unresponsiveness, i finally found what is the most likely culprit:
>>
>> a typo in the jenkins config where the java heap size was configured.
>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>> random and non-deterministic system hangs we've had over the past
>> couple of years.
>>
>> anyways, it's been corrected and the master seems to be humming along,
>> for real this time, w/o issue.  i'll continue to keep an eye on this
>> for the rest of the week, but things are looking MUCH better now.
>>
>> sorry again for the interruptions in service.
>>
>> shane
>>
>> On Wed, May 17, 2017 at 9:59 AM, shane knapp  wrote:
>> > ok, we're back up, system load looks cromulent and we're happily
>> > building (again).
>> >
>> > shane
>> >
>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp 
>> > wrote:
>> >> i'm going to need to perform a quick reboot on the jenkins master.  it
>> >> looks like it's hung again.
>> >>
>> >> sorry about this!
>> >>
>> >> shane
>> >>
>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp 
>> >> wrote:
>> >>> ...but just now i started getting alerts on system load, which was
>> >>> rather high.  i had to kick jenkins again, and will keep an eye on 
>> >>> the
>> >>> master and possible need to reboot.
>> >>>
>> >>> sorry about the interruption of service...
>> >>>
>> >>> shane
>> >>>
>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp 
>> >>> wrote:
>>  ...so i kicked it and it's now back up and happily building.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from
that side we should be good to cut another RC :)

On Thu, 18 May 2017 at 00:18 Russell Spitzer 
wrote:

> Seeing an issue with the DataScanExec and some of our integration tests
> for the SCC. Running dataframe read and writes from the shell seems fine
> but the Redaction code seems to get a "None" when doing
> SparkSession.getActiveSession.get in our integration tests. I'm not sure
> why but i'll dig into this later if I get a chance.
>
> Example Failed Test
>
> https://github.com/datastax/spark-cassandra-connector/blob/v2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/sql/CassandraSQLSpec.scala#L311
>
> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.util.NoSuchElementException:
> None.get
> [info] java.util.NoSuchElementException: None.get
> [info] at scala.None$.get(Option.scala:347)
> [info] at scala.None$.get(Option.scala:345)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
> ```
>
> Again this only seems to repo in our IT suite so i'm not sure if this is a
> real issue.
>
>
> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
> wrote:
>
>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
>> everyone who helped out on those!
>>
>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
>> essentially all for documentation.
>>
>> Joseph
>>
>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>> good idea to raise it to blocker at this point.
>>>
>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>  wrote:
>>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>>> create
>>> > another RC once ML has had time to take care of the more critical
>>> problems.
>>> > In the meantime please keep testing this release!
>>> >
>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki 
>>> > wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>> for
>>> >> core have passed.
>>> >>
>>> >> $ java -version
>>> >> openjdk version "1.8.0_111"
>>> >> OpenJDK Runtime Environment (build
>>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>>> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>> >> package install
>>> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>> core
>>> >> ...
>>> >> Run completed in 15 minutes, 12 seconds.
>>> >> Total number of tests run: 1940
>>> >> Suites: completed 206, aborted 0
>>> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
>>> >> All tests passed.
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] BUILD SUCCESS
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] Total time: 16:51 min
>>> >> [INFO] Finished at: 2017-05-09T17:51:04+09:00
>>> >> [INFO] Final Memory: 53M/514M
>>> >> [INFO]
>>> >>
>>> 
>>> >> [WARNING] The requested profile "hive" could not be activated because
>>> it
>>> >> does not exist.
>>> >>
>>> >>
>>> >> Kazuaki Ishizaki,
>>> >>
>>> >>
>>> >>
>>> >> From:Michael Armbrust 
>>> >> To:"dev@spark.apache.org" 
>>> >> Date:2017/05/05 02:08
>>> >> Subject:[VOTE] Apache Spark 2.2.0 (RC2)
>>> >> 
>>> >>
>>> >>
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and
>>> passes if
>>> >> a majority of at least 3 +1 PMC votes are cast.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 2.2.0
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> The tag to be voted on is v2.2.0-rc2
>>> >> (1d4017b44d5e6ad156abeaae6371747f111dd1f9)
>>> >>
>>> >> List of JIRA tickets resolved can be found with this filter.
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/
>>> >>
>>> >> Release artifacts are signed with the following key:
>>> >> https://people.apache.org/keys/committer/pwendell.asc
>>> >>
>>> >> The staging repos

Re: [Spark SQL] ceil and floor functions on doubles

2017-05-19 Thread Anton Okolnychyi
Hi Dongjoon,

yeah, it seems to be the same. So, was it done on purpose to match the
behavior of Hive?

Best regards,
Anton

2017-05-19 16:39 GMT+02:00 Dong Joon Hyun :

> Hi, Anton.
>
>
>
> It’s the same result with Hive, isn’t it?
>
>
>
> hive> select 9.223372036854786E20, ceil(9.223372036854786E20);
>
> OK
>
> _c0  _c1
>
> 9.223372036854786E20 9223372036854775807
>
> Time taken: 2.041 seconds, Fetched: 1 row(s)
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
> *From: *Anton Okolnychyi 
> *Date: *Friday, May 19, 2017 at 7:26 AM
> *To: *"dev@spark.apache.org" 
> *Subject: *[Spark SQL] ceil and floor functions on doubles
>
>
>
> Hi all,
>
>
>
> I am wondering why the results of ceil and floor functions on doubles are
> internally casted to longs. This causes loss of precision since doubles can
> hold bigger numbers.
>
>
>
> Consider the following example:
>
>
>
> // 9.223372036854786E20 is greater than Long.MaxValue
>
> val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()
>
> df.createOrReplaceTempView("tbl")
>
> spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from
> tbl").show()
>
>
>
> +-+-+
>
> |original_value   | ceil_result   |
>
> +-+-+
>
> | 9.223372036854786E20 | 9223372036854775807 |
>
> +-+-+
>
>
>
> So, the original double value is rounded to 9223372036854775807, which is
> Long.MaxValue.
>
> I think that it would be better to return 9.223372036854786E20 as it was
> (and as it is actually returned by math.ceil before the cast to long). If
> it is a problem, then I can fix this.
>
>
>
> Best regards,
>
> Anton
>


Re: [Spark SQL] ceil and floor functions on doubles

2017-05-19 Thread Dong Joon Hyun
Hi, Anton.

It’s the same result with Hive, isn’t it?

hive> select 9.223372036854786E20, ceil(9.223372036854786E20);
OK
_c0  _c1
9.223372036854786E20 9223372036854775807
Time taken: 2.041 seconds, Fetched: 1 row(s)

Bests,
Dongjoon.

From: Anton Okolnychyi 
Date: Friday, May 19, 2017 at 7:26 AM
To: "dev@spark.apache.org" 
Subject: [Spark SQL] ceil and floor functions on doubles

Hi all,

I am wondering why the results of ceil and floor functions on doubles are 
internally casted to longs. This causes loss of precision since doubles can 
hold bigger numbers.

Consider the following example:

// 9.223372036854786E20 is greater than Long.MaxValue
val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()
df.createOrReplaceTempView("tbl")
spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from 
tbl").show()

+-+-+
|original_value   | ceil_result   |
+-+-+
| 9.223372036854786E20 | 9223372036854775807 |
+-+-+

So, the original double value is rounded to 9223372036854775807, which is 
Long.MaxValue.
I think that it would be better to return 9.223372036854786E20 as it was (and 
as it is actually returned by math.ceil before the cast to long). If it is a 
problem, then I can fix this.

Best regards,
Anton


[Spark SQL] ceil and floor functions on doubles

2017-05-19 Thread Anton Okolnychyi
Hi all,

I am wondering why the results of ceil and floor functions on doubles are
internally casted to longs. This causes loss of precision since doubles can
hold bigger numbers.

Consider the following example:

// 9.223372036854786E20 is greater than Long.MaxValue
val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()
df.createOrReplaceTempView("tbl")
spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from
tbl").show()

+-+-+
|original_value   | ceil_result   |
+-+-+
| 9.223372036854786E20 | 9223372036854775807 |
+-+-+

So, the original double value is rounded to 9223372036854775807, which is
Long.MaxValue.
I think that it would be better to return 9.223372036854786E20 as it was
(and as it is actually returned by math.ceil before the cast to long). If
it is a problem, then I can fix this.

Best regards,
Anton