Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-01 Thread Robert Kruszewski
-1 since https://issues.apache.org/jira/browse/SPARK-17213 is a correctness 
regression from 2.0 release. The commit that caused it is 
776d183c82b424ef7c3cae30537d8afe9b9eee83. 

 

Robert

 

From: Reynold Xin 
Date: Tuesday, November 29, 2016 at 1:25 AM
To: "dev@spark.apache.org" 
Subject: [VOTE] Apache Spark 2.1.0 (RC1)

 

Please vote on releasing the following candidate as Apache Spark version 2.1.0. 
The vote is open until Thursday, December 1, 2016 at 18:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.

 

[ ] +1 Release this package as Apache Spark 2.1.0

[ ] -1 Do not release this package because ...

 

 

To learn more about Apache Spark, please see http://spark.apache.org/

 

The tag to be voted on is v2.1.0-rc1 (80aabc0bd33dc5661a90133156247e7a8c1bf7f5)

 

The release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/

 

Release artifacts are signed with the following key:

https://people.apache.org/keys/committer/pwendell.asc

 

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1216/

 

The documentation corresponding to this release can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/

 

 

===

How can I help test this release?

===

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

 

===

What should happen to JIRA tickets still targeting 2.1.0?

===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.1 or 2.2.0.

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
It could be something like this
https://github.com/zero323/spark/commit/b1f4d8218629b56b0982ee58f5b93a40305985e0
 
but I am not fully satisfied.

On 11/30/2016 07:34 PM, Reynold Xin wrote:
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz
> > wrote:
>
> It is platform specific so theoretically can be larger, but 2**63
> - 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit
> platform. I can submit a patch but I am not sure how to proceed.
> Personally I would set
>
> unboundedPreceding = -sys.maxsize
>
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
>
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>> Ah ok for some reason when I did the pull request sys.maxsize was
>> much larger than 2^63. Do you want to submit a patch to fix this?
>>
>>
>> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz
>> > wrote:
>>
>> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the
>> code which used to work before is off by one.
>>
>> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>> Can you give a repro? Anything less than -(1 << 63) is
>>> considered negative infinity (i.e. unbounded preceding).
>>>
>>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
>>> > wrote:
>>>
>>> Hi,
>>>
>>> I've been looking at the SPARK-17845 and I am curious if
>>> there is any
>>> reason to make it a breaking change. In Spark 2.0 and
>>> below we could use:
>>>
>>>
>>> 
>>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>> sys.maxsize))
>>>
>>> In 2.1.0 this code will silently produce incorrect
>>> results (ROWS BETWEEN
>>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>> Window.unboundedPreceding equal -sys.maxsize to ensure
>>> backward
>>> compatibility?
>>>
>>> --
>>>
>>> Maciej Szymkiewicz
>>>
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>>
>>>
>>
>> -- 
>> Maciej Szymkiewicz
>>
>>
>
> -- 
> Maciej Szymkiewicz
>
>

-- 
Maciej Szymkiewicz



unsubscribe

2016-12-01 Thread Vishal Soni



Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Reynold Xin
Can you submit a pull request with test cases based on that change?


On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz , 
wrote:
> This doesn't affect that. The only concern is what we consider to UNBOUNDED 
> on Python side.
>
> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
> > I may be mistaken but if I remember correctly spark behaves differently 
> > when it is bounded in the past and when it is not. Specifically I seem to 
> > recall a fix which made sure that when there is no lower bound then the 
> > aggregation is done one by one instead of doing the whole range for each 
> > window. So I believe it should be configured exactly the same as in 
> > scala/java so the optimization would take place.
> > Assaf.
> >
> > From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden 
> > email]]
> > Sent: Wednesday, November 30, 2016 8:35 PM
> > To: Mendelson, Assaf
> > Subject: Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function 
> > frame boundary API
> >
> > Yes I'd define unboundedPreceding to -sys.maxsize, but also any value less 
> > than min(-sys.maxsize, _JAVA_MIN_LONG) are considered unboundedPreceding 
> > too. We need to be careful with long overflow when transferring data over 
> > to Java.
> >
> >
> > On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]> wrote:
> > It is platform specific so theoretically can be larger, but 2**63 - 1 is a 
> > standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can submit a 
> > patch but I am not sure how to proceed. Personally I would set
> >
> > unboundedPreceding = -sys.maxsize
> >
> > unboundedFollowing = sys.maxsize
> > to keep backwards compatibility.
> > On 11/30/2016 06:52 PM, Reynold Xin wrote:
> > > Ah ok for some reason when I did the pull request sys.maxsize was much 
> > > larger than 2^63. Do you want to submit a patch to fix this?
> > >
> > >
> > > On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden email]> 
> > > wrote:
> > > The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which 
> > > used to work before is off by one.
> > > On 11/30/2016 06:43 PM, Reynold Xin wrote:
> > > > Can you give a repro? Anything less than -(1 << 63) is considered 
> > > > negative infinity (i.e. unbounded preceding).
> > > >
> > > > On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden email]> 
> > > > wrote:
> > > > Hi,
> > > >
> > > > I've been looking at the SPARK-17845 and I am curious if there is any
> > > > reason to make it a breaking change. In Spark 2.0 and below we could 
> > > > use:
> > > >
> > > >     Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> > > > sys.maxsize))
> > > >
> > > > In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
> > > > -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> > > > Window.unboundedPreceding equal -sys.maxsize to ensure backward
> > > > compatibility?
> > > >
> > > > --
> > > >
> > > > Maciej Szymkiewicz
> > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: [hidden email]
> > > >
> > >
> > >
> > > --
> > >
> > > Maciej Szymkiewicz
> > >
> >
> >
> > --
> >
> > Maciej Szymkiewicz
> >
> >
> > If you reply to this email, your message will be added to the discussion 
> > below:
> > http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
> > To start a new topic under Apache Spark Developers List, email [hidden 
> > email]
> > To unsubscribe from Apache Spark Developers List, click here.
> > NAML
> >
> > View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More 
> > self-evident window function frame boundary API
> > Sent from the Apache Spark Developers List mailing list archive at 
> > Nabble.com.
>
>
> --
> Maciej Szymkiewicz


Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
This doesn't affect that. The only concern is what we consider to
UNBOUNDED on Python side.


On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>
> I may be mistaken but if I remember correctly spark behaves
> differently when it is bounded in the past and when it is not.
> Specifically I seem to recall a fix which made sure that when there is
> no lower bound then the aggregation is done one by one instead of
> doing the whole range for each window. So I believe it should be
> configured exactly the same as in scala/java so the optimization would
> take place.
>
> Assaf.
>
>  
>
> *From:*rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Wednesday, November 30, 2016 8:35 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
> function frame boundary API
>
>  
>
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>  
>
>  
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
> > wrote:
>
> It is platform specific so theoretically can be larger, but 2**63 - 1
> is a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I
> can submit a patch but I am not sure how to proceed. Personally I
> would set
>
> unboundedPreceding = -sys.maxsize
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
>
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>
> Ah ok for some reason when I did the pull request sys.maxsize was
> much larger than 2^63. Do you want to submit a patch to fix this?
>
>  
>
>  
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
> email] > wrote:
>
> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
> which used to work before is off by one.
>
> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>
> Can you give a repro? Anything less than -(1 << 63) is
> considered negative infinity (i.e. unbounded preceding).
>
>  
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
> email] > wrote:
>
> Hi,
>
> I've been looking at the SPARK-17845 and I am curious if there
> is any
> reason to make it a breaking change. In Spark 2.0 and below we
> could use:
>
>
> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> sys.maxsize))
>
> In 2.1.0 this code will silently produce incorrect results
> (ROWS BETWEEN
> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> Window.unboundedPreceding equal -sys.maxsize to ensure backward
> compatibility?
>
> --
>
> Maciej Szymkiewicz
>
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>  
>
>  
>
> -- 
>
> Maciej Szymkiewicz
>
>  
>
>  
>
> -- 
> Maciej Szymkiewicz
>
>  
>
>  
>
> 
>
> *If you reply to this email, your message will be added to the
> discussion below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
>
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
>
>
> 
> View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More
> self-evident window function frame boundary API
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.

-- 
Maciej Szymkiewicz



Re: REST api for monitoring Spark Streaming

2016-12-01 Thread Chan Chor Pang

hi everyone

I have done the coding and create the PR
the implementation is straightforward and similar to the api in spark-core
but we still need someone with streaming background to verify the patch
just to make sure everything is OK

so, please anyone can help?
https://github.com/apache/spark/pull/16000


On 11/8/16 1:46 PM, Chan Chor Pang wrote:


Thank you

this should take me at least a few days, and will let you know as soon 
as the PR ready.



On 11/8/16 11:44 AM, Tathagata Das wrote:
This may be a good addition. I suggest you read our guidelines on 
contributing code to Spark.


https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges

Its long document but it should have everything for you to figure out 
how to contribute your changes. I hope to see your changes in a 
Github PR soon!


TD

On Mon, Nov 7, 2016 at 5:30 PM, Chan Chor Pang 
> wrote:


hi everyone

it seems that there is not much who interested in creating a api
for Streaming.
never the less I still really want the api for monitoring.
so i tried to see if i can implement by my own.

after some test,
i believe i can achieve the goal by
1. implement a package(org.apache.spark.streaming.status.api.v1)
that serve the same purpose as org.apache.spark.status.api.v1
2. register the api path through StreamingTab
and 3. retrive the streaming informateion through
StreamingJobProgressListener

what my most concern now is will my implementation be able to
merge to the main stream.

im new to open source project, so anyone could please show me
some light?
how should/could i proceed to make my implementation to be able
to merge to the main stream.


here is my test code base on v1.6.0
###
diff --git

a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
new file mode 100644
index 000..690e2d8
--- /dev/null
+++

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
@@ -0,0 +1,68 @@
+package org.apache.spark.streaming.status.api.v1
+
+import java.io.OutputStream
+import java.lang.annotation.Annotation
+import java.lang.reflect.Type
+import java.text.SimpleDateFormat
+import java.util.{Calendar, SimpleTimeZone}
+import javax.ws.rs.Produces
+import javax.ws.rs.core.{MediaType, MultivaluedMap}
+import javax.ws.rs.ext.{MessageBodyWriter, Provider}
+
+import com.fasterxml.jackson.annotation.JsonInclude
+import com.fasterxml.jackson.databind.{ObjectMapper,
SerializationFeature}
+
+@Provider
+@Produces(Array(MediaType.APPLICATION_JSON))
+private[v1] class JacksonMessageWriter extends
MessageBodyWriter[Object]{
+
+  val mapper = new ObjectMapper() {
+override def writeValueAsString(t: Any): String = {
+  super.writeValueAsString(t)
+}
+  }
+
mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule)
+  mapper.enable(SerializationFeature.INDENT_OUTPUT)
+  mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)
+  mapper.setDateFormat(JacksonMessageWriter.makeISODateFormat)
+
+  override def isWriteable(
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Boolean = {
+  true
+  }
+
+  override def writeTo(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType,
+  multivaluedMap: MultivaluedMap[String, AnyRef],
+  outputStream: OutputStream): Unit = {
+t match {
+  //case ErrorWrapper(err) =>
outputStream.write(err.getBytes("utf-8"))
+  case _ => mapper.writeValue(outputStream, t)
+}
+  }
+
+  override def getSize(
+  t: Object,
+  aClass: Class[_],
+  `type`: Type,
+  annotations: Array[Annotation],
+  mediaType: MediaType): Long = {
+-1L
+  }
+}
+
+private[spark] object JacksonMessageWriter {
+  def makeISODateFormat: SimpleDateFormat = {
+val iso8601 = new
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'GMT'")
+val cal = Calendar.getInstance(new SimpleTimeZone(0, "GMT"))
+iso8601.setCalendar(cal)
+iso8601
+  }
+}
diff --git

a/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala

b/streaming/src/main/scala/org/apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala
new file mode 100644
index 000..f4e43dd

Hidden Markov Model or Bayes Networks in Spark - MS Thesis theme

2016-12-01 Thread Alex153
As part of my MS Thesis (in computer science) project I am looking for chance
to implement some machine learning or data mining algorithms. Are there good
ideas for this - are there some unrealised algorithms that can be great
contribution to the project?

I am thinking about Hidden Markov Models and/or Bayes Networs - can here be
interest in them? I can consider ideas for other projects or algorithms as
well.

A,



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Hidden-Markov-Model-or-Bayes-Networks-in-Spark-MS-Thesis-theme-tp20070.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org