[ANNOUNCE] CFP open for ApacheCon North America 2016

2015-11-25 Thread Rich Bowen
Community growth starts by talking with those interested in your
project. ApacheCon North America is coming, are you?

We are delighted to announce that the Call For Presentations (CFP) is
now open for ApacheCon North America. You can submit your proposed
sessions at
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp
for big data talks and
http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp
for all other topics.

ApacheCon North America will be held in Vancouver, Canada, May 9-13th
2016. ApacheCon has been running every year since 2000, and is the place
to build your project communities.

While we will consider individual talks we prefer to see related
sessions that are likely to draw users and community members. When
submitting your talk work with your project community and with related
communities to come up with a full program that will walk attendees
through the basics and on into mastery of your project in example use
cases. Content that introduces what's new in your latest release is also
of particular interest, especially when it builds upon existing well
know application models. The goal should be to showcase your project in
ways that will attract participants and encourage engagement in your
community, Please remember to involve your whole project community (user
and dev lists) when building content. This is your chance to create a
project specific event within the broader ApacheCon conference.

Content at ApacheCon North America will be cross-promoted as
mini-conferences, such as ApacheCon Big Data, and ApacheCon Mobile, so
be sure to indicate which larger category your proposed sessions fit into.

Finally, please plan to attend ApacheCon, even if you're not proposing a
talk. The biggest value of the event is community building, and we count
on you to make it a place where your project community is likely to
congregate, not just for the technical content in sessions, but for
hackathons, project summits, and good old fashioned face-to-face networking.

-- 
rbo...@apache.org
http://apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: VerifyError running Spark SQL code?

2015-11-25 Thread Josh Rosen
I think I've also seen this issue as well, but in a different suite. I
wasn't able to easily get to the bottom of it, though. What JDK / JRE are
you using? I'm on


Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

on OSX.

On Wed, Nov 25, 2015 at 4:51 PM, Marcelo Vanzin  wrote:

> I've been running into this error when running Spark SQL recently; no
> matter what I try (completely clean build or anything else) doesn't
> seem to fix it. Anyone has some idea of what's wrong?
>
> [info] Exception encountered when attempting to run a suite with class
> name: org.apache.spark.sql.execution.ui.SQLListenerMemoryLeakSuite ***
> ABORTED *** (4 seconds, 111 milliseconds)
> [info]   java.lang.VerifyError: Bad  method call from inside of a
> branch
> [info] Exception Details:
> [info]   Location:
> [info]
>  
> org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;)V
> @82: invokespecial
>
> Same happens with spark shell (when instantiating SQLContext), so not
> an issue with the test code...
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


VerifyError running Spark SQL code?

2015-11-25 Thread Marcelo Vanzin
I've been running into this error when running Spark SQL recently; no
matter what I try (completely clean build or anything else) doesn't
seem to fix it. Anyone has some idea of what's wrong?

[info] Exception encountered when attempting to run a suite with class
name: org.apache.spark.sql.execution.ui.SQLListenerMemoryLeakSuite ***
ABORTED *** (4 seconds, 111 milliseconds)
[info]   java.lang.VerifyError: Bad  method call from inside of a branch
[info] Exception Details:
[info]   Location:
[info] 
org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;)V
@82: invokespecial

Same happens with spark shell (when instantiating SQLContext), so not
an issue with the test code...

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark checkpoint problem

2015-11-25 Thread wyphao.2007




I am test checkpoint to understand how it works, My code as following:


scala> val data = sc.parallelize(List("a", "b", "c"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :15


scala> sc.setCheckpointDir("/tmp/checkpoint")
15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be 
non-local if Spark is running on a cluster: /tmp/checkpoint1


scala> data.checkpoint


scala> val temp = data.map(item => (item, 1))
temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at 
:17


scala> temp.checkpoint


scala> temp.count


but I found that only the temp RDD is checkpont in the /tmp/checkpoint 
directory, The data RDD is not checkpointed! I found the doCheckpoint function  
in the org.apache.spark.rdd.RDD class:


  private[spark] def doCheckpoint(): Unit = {
RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, 
ignoreParent = true) {
  if (!doCheckpointCalled) {
doCheckpointCalled = true
if (checkpointData.isDefined) {
  checkpointData.get.checkpoint()
} else {
  dependencies.foreach(_.rdd.doCheckpoint())
}
  }
}
  }


from the code above, Only the last RDD(In my case is temp) will be 
checkpointed, My question : Is deliberately designed or this is a bug?


Thank you.










 

Re: VerifyError running Spark SQL code?

2015-11-25 Thread Marcelo Vanzin
$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)

(On Linux.) It's not that particular suite, though, it's anything I do
that touches Spark SQL...

On Wed, Nov 25, 2015 at 4:54 PM, Josh Rosen  wrote:
> I think I've also seen this issue as well, but in a different suite. I
> wasn't able to easily get to the bottom of it, though. What JDK / JRE are
> you using? I'm on
>
>
> Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
>
> on OSX.
>
> On Wed, Nov 25, 2015 at 4:51 PM, Marcelo Vanzin  wrote:
>>
>> I've been running into this error when running Spark SQL recently; no
>> matter what I try (completely clean build or anything else) doesn't
>> seem to fix it. Anyone has some idea of what's wrong?
>>
>> [info] Exception encountered when attempting to run a suite with class
>> name: org.apache.spark.sql.execution.ui.SQLListenerMemoryLeakSuite ***
>> ABORTED *** (4 seconds, 111 milliseconds)
>> [info]   java.lang.VerifyError: Bad  method call from inside of a
>> branch
>> [info] Exception Details:
>> [info]   Location:
>> [info]
>> org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;)V
>> @82: invokespecial
>>
>> Same happens with spark shell (when instantiating SQLContext), so not
>> an issue with the test code...
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: VerifyError running Spark SQL code?

2015-11-25 Thread Marcelo Vanzin
Seems to be some new thing with recent JDK updates according to the
intertubes. This patch seems to work around it:

--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala
@@ -63,11 +63,7 @@ case class HyperLogLogPlusPlus(
  def this(child: Expression, relativeSD: Expression) = {
this(
  child = child,
-  relativeSD = relativeSD match {
-case Literal(d: Double, DoubleType) => d
-case _ =>
-  throw new AnalysisException("The second argument should be
a double literal.")
-  },
+  relativeSD = HyperLogLogPlusPlus.validateRelativeSd(relativeSD),
  mutableAggBufferOffset = 0,
  inputAggBufferOffset = 0)
  }
@@ -448,4 +444,11 @@ object HyperLogLogPlusPlus {
Array(189083, 185696.913, 182348.774, 179035.946, 175762.762,
172526.444, 169329.754, 166166.099, 163043.269, 159958.91, 156907.912,
153906.845,
150924.199, 147996.568, 145093.457, 142239.233, 139421.475, 136632.27,
133889.588, 131174.2, 128511.619, 125868.621, 123265.385, 120721.061,
118181.76
9, 115709.456, 113252.446, 110840.198, 108465.099, 106126.164,
103823.469, 101556.618, 99308.004, 97124.508, 94937.803, 92833.731,
90745.061, 88677.62
7, 86617.47, 84650.442, 82697.833, 80769.132, 78879.629, 77014.432,
75215.626, 73384.587, 71652.482, 69895.93, 68209.301, 66553.669,
64921.981, 63310.
323, 61742.115, 60205.018, 58698.658, 57190.657, 55760.865, 54331.169,
52908.167, 51550.273, 50225.254, 48922.421, 47614.533, 46362.049,
45098.569, 43
926.083, 42736.03, 41593.473, 40425.26, 39316.237, 38243.651,
37170.617, 36114.609, 35084.19, 34117.233, 33206.509, 32231.505,
31318.728, 30403.404, 2
9540.055001, 28679.236, 27825.862, 26965.216, 26179.148, 25462.08,
24645.952, 23922.523, 23198.144, 22529.128, 21762.417999,
21134.779, 20459.
117, 19840.818, 19187.04, 18636.368999, 17982.831,
17439.738999, 16874.547, 16358.216999, 15835.684, 15352.914,
14823.681, 14329.313, 1381
6.897, 13342.874, 12880.882, 12491.648, 12021.254, 11625.392,
11293.761001, 10813.697, 10456.209, 10099.074, 9755.391,
9393.1850006, 9
047.5790003, 8657.984, 8395.8590005, 8033,
7736.9590003, 7430.5969995, 7258.4769996,
6924.5820005, 6691.293, 6
357.9250005, 6202.0570003, 5921.1970004, 5628.283,
5404.967, 5226.7110001, 4990.7560005,
4799.7739998, 4622.93
09998, 4472.478, 4171.7870001, 3957.462,
3868.9520005, 3691.1430004, 3474.6310005,
3341.6720002, 3109.141, 307
1.9740005, 2796.4039998, 2756.1779996, 2611.467,
2471.935, 2382.2639997, 2209.2240005,
2142.283, 2013.9610
001, 1911.184, 1818.2709995, 1668.4790005,
1519.6580005, 1469.6759998, 1367.1380004,
1248.5289998, 1181.236000
3, 1022.7190004, 1088.2070005, 959.0360008,
876.09599903, 791.18399892, 703.33700058,
731.9453, 586.8640006, 526.0
2499907, 323.00499888, 320.44800091, 340.67299952,
309.63899966, 216.60199955, 102.92299952,
19.23907, -0.1140
0059605, -32.624000689, -89.317999702, -153.49799905,
-64.297000205, -143.6956, -259.49799905,
-253.01799924, -213.948
00091, -397.5984, -434.00600052, -403.47500093,
-297.95800101, -404.31700039, -528.89899976,
-506.62100043, -513.2
0500075, -479.35100024, -596.13999898, -527.0163,
-664.68100099, -680.30600099, -704.0547,
-850.48600034, -757
.4320003, -713.30899892)
  )
  // scalastyle:on
+
+  private def validateRelativeSd(relativeSD: Expression): Double =
relativeSD match {
+case Literal(d: Double, DoubleType) => d
+case _ =>
+  throw new AnalysisException("The second argument should be a
double literal.")
+  }
+
}


On Wed, Nov 25, 2015 at 5:29 PM, Marcelo Vanzin  wrote:
> $ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
>
> (On Linux.) It's not that particular suite, though, it's anything I do
> that touches Spark SQL...
>
> On Wed, Nov 25, 2015 at 4:54 PM, Josh Rosen  wrote:
>> I think I've also seen this issue as well, but in a different suite. I
>> wasn't able to easily get to the bottom of it, though. What JDK / JRE are
>> you using? I'm on
>>
>>
>> Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
>>
>> on OSX.
>>
>> On Wed, Nov 25, 2015 at 4:51 PM, Marcelo Vanzin  wrote:
>>>
>>> I've been running into this error when running Spark SQL recently; no
>>> matter what I try (completely clean build or anything else) doesn't
>>> seem to fix it. Anyone has some 

How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-11-25 Thread Alexander Pivovarov
Hi Everyone

I noticed that spark ec2 script is outdated.
How to add 1.5.2 support to ec2/spark_ec2.py?
What else (except of updating spark version in the script) should be done
to add 1.5.2 support?

We also need to update scala to 2.10.4 (currently it's 2.10.3)

Alex


RE: Spark checkpoint problem

2015-11-25 Thread 张志强(旺轩)
What’s your spark version?

 

发件人: wyphao.2007 [mailto:wyphao.2...@163.com] 
发送时间: 2015年11月26日 10:04
收件人: user
抄送: dev@spark.apache.org
主题: Spark checkpoint problem

 

 

 

I am test checkpoint to understand how it works, My code as following:

 

scala> val data = sc.parallelize(List("a", "b", "c"))

data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
parallelize at :15

 

scala> sc.setCheckpointDir("/tmp/checkpoint")

15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
non-local if Spark is running on a cluster: /tmp/checkpoint1

 

scala> data.checkpoint

 

scala> val temp = data.map(item => (item, 1))

temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
at :17

 

scala> temp.checkpoint

 

scala> temp.count

 

but I found that only the temp RDD is checkpont in the /tmp/checkpoint
directory, The data RDD is not checkpointed! I found the doCheckpoint
function  in the org.apache.spark.rdd.RDD class:

 

  private[spark] def doCheckpoint(): Unit = {

RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
ignoreParent = true) {

  if (!doCheckpointCalled) {

doCheckpointCalled = true

if (checkpointData.isDefined) {

  checkpointData.get.checkpoint()

} else {

  dependencies.foreach(_.rdd.doCheckpoint())

}

  }

}

  }

 

from the code above, Only the last RDD(In my case is temp) will be
checkpointed, My question : Is deliberately designed or this is a bug?

 

Thank you.

 

 

 

 

 

 

 



Incremental Analysis with Spark

2015-11-25 Thread Sachith Withana
Hi folks!

I'm wondering if Sparks supports or hopes to support incremental data
analysis.

There are few use cases that prompted me to wonder.

ex: If we need to summarize last 30 days worth of data everyday,

1. Does Spark support time range based query execution ?
select * from foo where timestamp in last30days

2. Can we reuse the previously executed data?
ex: use the sum of the last 29 days and add the last days worth of data?

The second usecase is very tempting as it would improve the performance
drastically.

Any suggestions would be greatly appreciated.

-- 
Thanks,
Sachith Withana


Re: Using spark MLlib without installing Spark

2015-11-25 Thread Stavros Kontopoulos
You can even use it without spark as well (besides local). For example i
have used the following algo in some web app:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

Essentially some algorithms (i havent checked them all) they will have to
run the same steps in each partition so if you overlook the distributed
oriented parts (spark specific) of the code there is a lot of resuable
stuff.
You have just to use the api where that is public and conform to the
input/output contract of it.

There used to be some dependencies like Breeze for example in the api
hidden now (eg.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala),
but of course this is a hint not a list of what is available for your use
case. Mind though that this may not be the cleanest way to implement your
use case or might sound a hack ;)

As an alternative choice besides spark local you could use a job server (
https://github.com/spark-jobserver/spark-jobserver) to integrate with your
app as a proxy for spark or have a spark service there to respond back with
results. From a design point of you its best to separate the concerns for
several reasons: scaling , utilization etc.

On Sun, Nov 22, 2015 at 5:03 AM, Reynold Xin  wrote:

> You can use MLlib and Spark directly without "installing anything". Just
> run Spark in local mode.
>
>
> On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski 
> wrote:
>
>> Bowen,
>>
>> What Andy is doing in the notebook is a slightly different thing. He’s
>> using sbt to bring all spark jars (core, mllib, repl, what have you). You
>> could use maven for that. He then creates a repl and submits all the spark
>> code into it.
>> Pretty sure spark unit tests cover similar uses cases. Maybe not mllib
>> per se but this kind of submission.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Sunday, 22 November 2015 at 01:01, bowen zhang wrote:
>>
>> Thanks Rad for info. I looked into the repo and see some .snb file using
>> spark mllib. Can you give me a more specific place to look for when
>> invoking the mllib functions? What if I just want to invoke some of the ML
>> functions in my HelloWorld.java?
>>
>> --
>> *From:* Rad Gruchalski 
>> *To:* bowen zhang 
>> *Cc:* "dev@spark.apache.org" 
>> *Sent:* Saturday, November 21, 2015 3:43 PM
>> *Subject:* Re: Using spark MLlib without installing Spark
>>
>> Bowen,
>>
>> One project to look at could be spark-notebook:
>> https://github.com/andypetrella/spark-notebook
>> It uses Spark you in the way you intend to use it.
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>>
>> On Sunday, 22 November 2015 at 00:38, bowen zhang wrote:
>>
>> Hi folks,
>> I am a big fan of Spark's Mllib package. I have a java web app where I
>> want to run some ml jobs inside the web app. My question is: is there a way
>> to just import spark-core and spark-mllib jars to invoke my ML jobs without
>> installing the entire Spark package? All the tutorials related Spark seems
>> to indicate installing Spark is a pre-condition for this.
>>
>> Thanks,
>> Bowen
>>
>>
>>
>>
>>
>>
>


-- 

Stavros Kontopoulos







Re: A proposal for Spark 2.0

2015-11-25 Thread Reynold Xin
I don't think we should drop support for Scala 2.10, or make it harder in
terms of operations for people to upgrade.

If there are further objections, I'm going to bump remove the 1.7 version
and retarget things to 2.0 on JIRA.


On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza 
wrote:

> I see.  My concern is / was that cluster operators will be reluctant to
> upgrade to 2.0, meaning that developers using those clusters need to stay
> on 1.x, and, if they want to move to DataFrames, essentially need to port
> their app twice.
>
> I misunderstood and thought part of the proposal was to drop support for
> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> will make it less palatable to cluster administrators than releases in the
> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>
> -Sandy
>
>
> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
> wrote:
>
>> What are the other breaking changes in 2.0 though? Note that we're not
>> removing Scala 2.10, we're just making the default build be against Scala
>> 2.11 instead of 2.10. There seem to be very few changes that people would
>> worry about. If people are going to update their apps, I think it's better
>> to make the other small changes in 2.0 at the same time than to update once
>> for Dataset and another time for 2.0.
>>
>> BTW just refer to Reynold's original post for the other proposed API
>> changes.
>>
>> Matei
>>
>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
>>
>> I think that Kostas' logic still holds.  The majority of Spark users, and
>> likely an even vaster majority of people running vaster jobs, are still on
>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>> to upgrade to the stable version of the Dataset / DataFrame API so they
>> don't need to do so twice.  Requiring that they absorb all the other ways
>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> difficult for them to make this transition.
>>
>> Using the same set of APIs also means that it will be easier to backport
>> critical fixes to the 1.x line.
>>
>> It's not clear to me that avoiding breakage of an experimental API in the
>> 1.x line outweighs these issues.
>>
>> -Sandy
>>
>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>> wrote:
>>
>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>> reason is that I already know we have to break some part of the
>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>> sooner (in one release) than later (in two releases). so the damage is
>>> smaller.
>>>
>>> I don't think whether we call Dataset/DataFrame experimental or not
>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>>> wrote:
>>>
 Ah, got it; by "stabilize" you meant changing the API, not just bug
 fixing.  We're on the same page now.

 On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
 wrote:

> A 1.6.x release will only fix bugs - we typically don't change APIs in
> z releases. The Dataset API is experimental and so we might be changing 
> the
> APIs before we declare it stable. This is why I think it is important to
> first stabilize the Dataset API with a Spark 1.7 release before moving to
> Spark 2.0. This will benefit users that would like to use the new Dataset
> APIs but can't move to Spark 2.0 because of the backwards incompatible
> changes, like removal of deprecated APIs, Scala 2.11 etc.
>
> Kostas
>
>
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
> m...@clearstorydata.com> wrote:
>
>> Why does stabilization of those two features require a 1.7 release
>> instead of 1.6.1?
>>
>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>> kos...@cloudera.com> wrote:
>>
>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>> like
>>> to propose we have one more 1.x release after Spark 1.6. This will 
>>> allow us
>>> to stabilize a few of the new features that were added in 1.6:
>>>
>>> 1) the experimental Datasets API
>>> 2) the new unified memory manager.
>>>
>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>> but there will be users that won't be able to seamlessly upgrade given 
>>> what
>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>> release with these new features/APIs 

Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza
I see.  My concern is / was that cluster operators will be reluctant to
upgrade to 2.0, meaning that developers using those clusters need to stay
on 1.x, and, if they want to move to DataFrames, essentially need to port
their app twice.

I misunderstood and thought part of the proposal was to drop support for
2.10 though.  If your broad point is that there aren't changes in 2.0 that
will make it less palatable to cluster administrators than releases in the
1.x line, then yes, 2.0 as the next release sounds fine to me.

-Sandy


On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
wrote:

> What are the other breaking changes in 2.0 though? Note that we're not
> removing Scala 2.10, we're just making the default build be against Scala
> 2.11 instead of 2.10. There seem to be very few changes that people would
> worry about. If people are going to update their apps, I think it's better
> to make the other small changes in 2.0 at the same time than to update once
> for Dataset and another time for 2.0.
>
> BTW just refer to Reynold's original post for the other proposed API
> changes.
>
> Matei
>
> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
>
> I think that Kostas' logic still holds.  The majority of Spark users, and
> likely an even vaster majority of people running vaster jobs, are still on
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
> to upgrade to the stable version of the Dataset / DataFrame API so they
> don't need to do so twice.  Requiring that they absorb all the other ways
> that Spark breaks compatibility in the move to 2.0 makes it much more
> difficult for them to make this transition.
>
> Using the same set of APIs also means that it will be easier to backport
> critical fixes to the 1.x line.
>
> It's not clear to me that avoiding breakage of an experimental API in the
> 1.x line outweighs these issues.
>
> -Sandy
>
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:
>
>> I actually think the next one (after 1.6) should be Spark 2.0. The reason
>> is that I already know we have to break some part of the DataFrame/Dataset
>> API as part of the Dataset design. (e.g. DataFrame.map should return
>> Dataset rather than RDD). In that case, I'd rather break this sooner (in
>> one release) than later (in two releases). so the damage is smaller.
>>
>> I don't think whether we call Dataset/DataFrame experimental or not
>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>> then mark them as stable in 2.1. Despite being "experimental", there has
>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>
>>
>>
>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>> wrote:
>>
>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>> fixing.  We're on the same page now.
>>>
>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>>> wrote:
>>>
 A 1.6.x release will only fix bugs - we typically don't change APIs in
 z releases. The Dataset API is experimental and so we might be changing the
 APIs before we declare it stable. This is why I think it is important to
 first stabilize the Dataset API with a Spark 1.7 release before moving to
 Spark 2.0. This will benefit users that would like to use the new Dataset
 APIs but can't move to Spark 2.0 because of the backwards incompatible
 changes, like removal of deprecated APIs, Scala 2.11 etc.

 Kostas


 On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra  wrote:

> Why does stabilization of those two features require a 1.7 release
> instead of 1.6.1?
>
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis  > wrote:
>
>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>> to propose we have one more 1.x release after Spark 1.6. This will allow 
>> us
>> to stabilize a few of the new features that were added in 1.6:
>>
>> 1) the experimental Datasets API
>> 2) the new unified memory manager.
>>
>> I understand our goal for Spark 2.0 is to offer an easy transition
>> but there will be users that won't be able to seamlessly upgrade given 
>> what
>> we have discussed as in scope for 2.0. For these users, having a 1.x
>> release with these new features/APIs stabilized will be very beneficial.
>> This might make Spark 1.7 a lighter release but that is not necessarily a
>> bad thing.
>>
>> Any thoughts on this timeline?
>>
>> Kostas Sakellis
>>
>>
>>
>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
>> wrote:
>>
>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>
>>>
>>>
>>> I mean, we need to 

Spark checkpoint problem

2015-11-25 Thread wyphao.2007
Hi, 


I am test checkpoint to understand how it works, My code as following:


scala> val data = sc.parallelize(List("a", "b", "c"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :15


scala> sc.setCheckpointDir("/tmp/checkpoint")
15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be 
non-local if Spark is running on a cluster: /tmp/checkpoint1


scala> data.checkpoint


scala> val temp = data.map(item => (item, 1))
temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at 
:17


scala> temp.checkpoint


scala> temp.count


but I found that only the temp RDD is checkpont in the /tmp/checkpoint 
directory, The data RDD is not checkpointed! I found the doCheckpoint function  
in the org.apache.spark.rdd.RDD class:


  private[spark] def doCheckpoint(): Unit = {
RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, 
ignoreParent = true) {
  if (!doCheckpointCalled) {
doCheckpointCalled = true
if (checkpointData.isDefined) {
  checkpointData.get.checkpoint()
} else {
  dependencies.foreach(_.rdd.doCheckpoint())
}
  }
}
  }


from the code above, Only the last RDD(In my case is temp) will be 
checkpointed, My question : Is deliberately designed or this is a bug?


Thank you.







Re:RE: Spark checkpoint problem

2015-11-25 Thread wyphao.2007
Spark 1.5.2.


在 2015-11-26 13:19:39,"张志强(旺轩)"  写道:


What’s your spark version?

发件人: wyphao.2007 [mailto:wyphao.2...@163.com]
发送时间: 2015年11月26日 10:04
收件人: user
抄送:dev@spark.apache.org
主题: Spark checkpoint problem

I am test checkpoint to understand how it works, My code as following:

 

scala> val data = sc.parallelize(List("a", "b", "c"))

data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at :15

 

scala> sc.setCheckpointDir("/tmp/checkpoint")

15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be 
non-local if Spark is running on a cluster: /tmp/checkpoint1

 

scala> data.checkpoint

 

scala> val temp = data.map(item => (item, 1))

temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at 
:17

 

scala> temp.checkpoint

 

scala> temp.count

 

but I found that only the temp RDD is checkpont in the /tmp/checkpoint 
directory, The data RDD is not checkpointed! I found the doCheckpoint function  
in the org.apache.spark.rdd.RDD class:

 

  private[spark] def doCheckpoint(): Unit = {

RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, 
ignoreParent = true) {

  if (!doCheckpointCalled) {

doCheckpointCalled = true

if (checkpointData.isDefined) {

  checkpointData.get.checkpoint()

} else {

  dependencies.foreach(_.rdd.doCheckpoint())

}

  }

}

  }

 

from the code above, Only the last RDD(In my case is temp) will be 
checkpointed, My question : Is deliberately designed or this is a bug?

 

Thank you.

 

 

 

 

 

 

 

Re: Incremental Analysis with Spark

2015-11-25 Thread chester
For the 2nd use case, can you save the result for first 29 days, then just get 
the last day result and add yourself ? This can be done outside of spark. Does 
that work for you



Sent from my iPad

> On Nov 25, 2015, at 9:46 PM, Sachith Withana  wrote:
> 
> Hi folks!
> 
> I'm wondering if Sparks supports or hopes to support incremental data 
> analysis.


> 
> There are few use cases that prompted me to wonder.
> 
> ex: If we need to summarize last 30 days worth of data everyday,
> 
> 1. Does Spark support time range based query execution ? 
> select * from foo where timestamp in last30days
> 
> 2. Can we reuse the previously executed data?
> ex: use the sum of the last 29 days and add the last days worth of data?
> 
> The second usecase is very tempting as it would improve the performance 
> drastically. 
> 
> Any suggestions would be greatly appreciated. 
> 
> -- 
> Thanks,
> Sachith Withana
>