Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Georg Heiler
Looking forward to the blog post.
Thanks for for pointing me to some of the simpler classes.
Nick Pentreath  schrieb am Fr. 18. Nov. 2016 um
02:53:

> @Holden look forward to the blog post - I think a user guide PR based on
> it would also be super useful :)
>
>
> On Fri, 18 Nov 2016 at 05:29 Holden Karau  wrote:
>
> I've been working on a blog post around this and hope to have it published
> early next month 
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley"  wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>


Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it
would also be super useful :)

On Fri, 18 Nov 2016 at 05:29 Holden Karau  wrote:

> I've been working on a blog post around this and hope to have it published
> early next month 
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley"  wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>


Re: [build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread Reynold Xin
Thanks for the headsup, Shane.


On Thu, Nov 17, 2016 at 2:33 PM, shane knapp  wrote:

> TL;DR:  amplab is becomine riselab, and is much more C++ oriented.
> centos 6 is so far behind, and i'm already having to roll C++
> compilers and various libraries by hand.  centos 7 is an absolute
> no-go, so we'll be moving the jenkins workers over to a recent (TBD)
> version of ubuntu server.  also, we'll finally get jenkins upgraded to
> the latest LTS version, as well as our insanely out of date plugins.
> riselab (me) will still run the build system, btw.
>
> oh, we'll also have a macOS worker!
>
> well, that was still pretty long.  :)
>
> anyways, you have the gist of it.  this is something we're going to do
> slowly, so as to not interrupt any spark, alluxio or lab builds.
>
> i'll be spinning up a master and two worker ubuntu nodes, and then
> port a couple of builds over and get the major kinks worked out.
> then, early next year, we can point the new master at the old workers,
> and one-by-one reinstall and deploy them w/ubuntu.
>
> i'll be reaching out to some individuals (you know who you are) as
> things progress.
>
> if we do this right, we'll have minimal service interruptions and end
> up w/a clean and fresh jenkins.  this is the opposite of our current
> jenkins, which is at least 4 years old and is super-glued and
> duct-taped together.
>
> the ubuntu staging servers should be ready early next week, but i
> don't foresee much work happening until after thanksgiving.
>


[build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread shane knapp
TL;DR:  amplab is becomine riselab, and is much more C++ oriented.
centos 6 is so far behind, and i'm already having to roll C++
compilers and various libraries by hand.  centos 7 is an absolute
no-go, so we'll be moving the jenkins workers over to a recent (TBD)
version of ubuntu server.  also, we'll finally get jenkins upgraded to
the latest LTS version, as well as our insanely out of date plugins.
riselab (me) will still run the build system, btw.

oh, we'll also have a macOS worker!

well, that was still pretty long.  :)

anyways, you have the gist of it.  this is something we're going to do
slowly, so as to not interrupt any spark, alluxio or lab builds.

i'll be spinning up a master and two worker ubuntu nodes, and then
port a couple of builds over and get the major kinks worked out.
then, early next year, we can point the new master at the old workers,
and one-by-one reinstall and deploy them w/ubuntu.

i'll be reaching out to some individuals (you know who you are) as
things progress.

if we do this right, we'll have minimal service interruptions and end
up w/a clean and fresh jenkins.  this is the opposite of our current
jenkins, which is at least 4 years old and is super-glued and
duct-taped together.

the ubuntu staging servers should be ready early next week, but i
don't foresee much work happening until after thanksgiving.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-18495

On Thu, Nov 17, 2016 at 12:23 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Nice catch Suhas, and thanks for the reference. Sounds like we need a
> tweak to the UI so this little feature is self-documenting.
>
> Will file a JIRA, unless someone else wants to take this one and file the
> JIRA themselves.
>
> On Thu, Nov 17, 2016 at 12:21 PM Suhas Gaddam 
> wrote:
>
> "Second, one of the RDDs is cached in the first stage (denoted by the
> green highlight). Since the enclosing operation involves reading from HDFS,
> caching this RDD means future computations on this RDD can access at least
> a subset of the original file from memory instead of from HDFS."
>
> from
> https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
>
> On Thu, Nov 17, 2016 at 9:19 AM, Reynold Xin  wrote:
>
> Ha funny. Never noticed that.
>
>
> On Thursday, November 17, 2016, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Hmm... somehow the image didn't show up.
>
> How about now?
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> Should I be able to see something?
>
> On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Some questions about this DAG visualization:
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> 1. What's the meaning of the green dot?
> 2. Should this be documented anywhere (if it isn't already)? Preferably a
> tooltip or something directly in the UI would explain the significance.
>
> Nick
>
>
>
>


Re: issues with github pull request notification emails missing

2016-11-17 Thread Xiao Li
Just FYI, normally, when we ping a people, the github can show the full
name after we type the github id. Below is an example:

[image: 内嵌图片 2]

Starting from last week, Reynold's full name is not shown. Does github
update their hash functions?

[image: 内嵌图片 1]

Thanks,

Xiao Li



2016-11-16 23:30 GMT-08:00 Holden Karau :

> +1 it seems like I'm missing a number of my GitHub email notifications
> lately (although since I run my own mail server and forward I've been
> assuming it's my own fault).
>
>
> I've also had issues with having greatly delayed notifications on some of
> my own pull requests but that might be unrelated.
>
> On Thu, Nov 17, 2016 at 8:20 AM Reynold Xin  wrote:
>
>> I've noticed that a lot of github pull request notifications no longer
>> come to my inbox. In the past I'd get an email for every reply to a pull
>> request that I subscribed to (i.e. commented on). Lately I noticed for a
>> lot of them I didn't get any emails, but if I opened the pull requests
>> directly on github, I'd see the new replies. I've looked at spam folder and
>> none of the missing notifications are there. So it's either github not
>> sending the notifications, or the emails are lost in transit.
>>
>> The way it manifests is that I often comment on a pull request, and then
>> I don't know whether the contributor (author) has updated it or not. From
>> the contributor's point of view, it looks like I've been ignoring the pull
>> request.
>>
>> I think this started happening when github switched over to the new code
>> review mode ( https://github.com/blog/2256-a-whole-new-github-
>> universe-announcing-new-tools-forums-and-features )
>>
>>
>> Did anybody else notice this issue?
>>
>>
>>


Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Holden Karau
I've been working on a blog post around this and hope to have it published
early next month 

On Nov 17, 2016 10:16 PM, "Joseph Bradley"  wrote:

Hi Georg,

It's true we need better documentation for this.  I'd recommend checking
out simple algorithms within Spark for examples:
ml.feature.Tokenizer
ml.regression.IsotonicRegression

You should not need to put your library in Spark's namespace.  The shared
Params in SPARK-7146 are not necessary to create a custom algorithm; they
are just niceties.

Though there aren't great docs yet, you should be able to follow existing
examples.  And I'd like to add more docs in the future!

Good luck,
Joseph

On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
wrote:

> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-
> a-custom-estimator-in-pyspark-mllib
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>


Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley
Hi Georg,

It's true we need better documentation for this.  I'd recommend checking
out simple algorithms within Spark for examples:
ml.feature.Tokenizer
ml.regression.IsotonicRegression

You should not need to put your library in Spark's namespace.  The shared
Params in SPARK-7146 are not necessary to create a custom algorithm; they
are just niceties.

Though there aren't great docs yet, you should be able to follow existing
examples.  And I'd like to add more docs in the future!

Good luck,
Joseph

On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
wrote:

> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-
> roll-a-custom-estimator-in-pyspark-mllib
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>


Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin
Adding a new data type is an enormous undertaking and very invasive. I
don't think it is worth it in this case given there are clear, simple
workarounds.


On Thu, Nov 17, 2016 at 12:24 PM, kant kodali  wrote:

> Can we have a JSONType for Spark SQL?
>
> On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande 
> wrote:
>
>> If you are dealing with a bunch of different schemas in 1 field, figuring
>> out a strategy to deal with that will depend on your data and does not
>> really have anything to do with spark since mapping your JSON payloads to
>> tractable data structures will depend on business logic.
>>
>> The strategy of pulling out a blob into its on rdd and feeding it into
>> the JSON loader should work for any data source once you have your data
>> strategy figured out.
>>
>> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali  wrote:
>>
>>> 1. I have a Cassandra Table where one of the columns is blob. And this
>>> blob contains a JSON encoded String however not all the blob's across the
>>> Cassandra table for that column are same (some blobs have difference json's
>>> than others) so In that case what is the best way to approach it? Do we
>>> need to put /group all the JSON Blobs that have same structure (same keys)
>>> into each individual data frame? For example, say if I have 5 json blobs
>>> that have same structure and another 3 JSON blobs that belongs to some
>>> other structure In this case do I need to create two data frames? (Attached
>>> is a screen shot of 2 rows of how my json looks like)
>>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>>> screen shot. Looks like I got the same data frame as my original one.
>>>
>>> Thanks much for these examples.
>>>
>>>
>>>
>>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande 
>>> wrote:
>>>
 I'm looking forward to 2.1 but, in the meantime, you can pull out the
 specific column into an RDD of JSON objects, pass this RDD into the
 read.json() and then join the results back onto your initial DF.

 Here is an example of what we do to unpack headers from Avro log data:

 def jsonLoad(path):
 #
 #load in the df
 raw = (sqlContext.read
 .format('com.databricks.spark.avro')
 .load(path)
 )
 #
 #define json blob, add primary key elements (hi and lo)
 #
 JSONBlob = concat(
 lit('{'),
 concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
 lit(',')),
 concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
 lit(',')),
 concat(lit('"response":'), decode('requestResponse.response',
 'UTF-8')),
 lit('}')
 )
 #
 #extract the JSON blob as a string
 rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
 #
 #transform the JSON string into a DF struct object
 structuredJSON = sqlContext.read.json(rawJSONString)
 #
 #join the structured JSON back onto the initial DF using the hi and
 lo join keys
 final = (raw.join(structuredJSON,
 ((raw['header.eventId.lo'] == structuredJSON['lo']) &
 (raw['header.eventId.hi'] == structuredJSON['hi'])),
 'left_outer')
 .drop('hi')
 .drop('lo')
 )
 #
 #win
 return final

 On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon 
> wrote:
>
>> Maybe it sounds like you are looking for from_json/to_json functions
>> after en/decoding properly.
>>
>
> Which are new built-in functions that will be released with Spark 2.1.
>


>>>
>>
>


Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread kant kodali
Can we have a JSONType for Spark SQL?

On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande  wrote:

> If you are dealing with a bunch of different schemas in 1 field, figuring
> out a strategy to deal with that will depend on your data and does not
> really have anything to do with spark since mapping your JSON payloads to
> tractable data structures will depend on business logic.
>
> The strategy of pulling out a blob into its on rdd and feeding it into the
> JSON loader should work for any data source once you have your data
> strategy figured out.
>
> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali  wrote:
>
>> 1. I have a Cassandra Table where one of the columns is blob. And this
>> blob contains a JSON encoded String however not all the blob's across the
>> Cassandra table for that column are same (some blobs have difference json's
>> than others) so In that case what is the best way to approach it? Do we
>> need to put /group all the JSON Blobs that have same structure (same keys)
>> into each individual data frame? For example, say if I have 5 json blobs
>> that have same structure and another 3 JSON blobs that belongs to some
>> other structure In this case do I need to create two data frames? (Attached
>> is a screen shot of 2 rows of how my json looks like)
>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>> screen shot. Looks like I got the same data frame as my original one.
>>
>> Thanks much for these examples.
>>
>>
>>
>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande 
>> wrote:
>>
>>> I'm looking forward to 2.1 but, in the meantime, you can pull out the
>>> specific column into an RDD of JSON objects, pass this RDD into the
>>> read.json() and then join the results back onto your initial DF.
>>>
>>> Here is an example of what we do to unpack headers from Avro log data:
>>>
>>> def jsonLoad(path):
>>> #
>>> #load in the df
>>> raw = (sqlContext.read
>>> .format('com.databricks.spark.avro')
>>> .load(path)
>>> )
>>> #
>>> #define json blob, add primary key elements (hi and lo)
>>> #
>>> JSONBlob = concat(
>>> lit('{'),
>>> concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
>>> lit(',')),
>>> concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
>>> lit(',')),
>>> concat(lit('"response":'), decode('requestResponse.response',
>>> 'UTF-8')),
>>> lit('}')
>>> )
>>> #
>>> #extract the JSON blob as a string
>>> rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
>>> #
>>> #transform the JSON string into a DF struct object
>>> structuredJSON = sqlContext.read.json(rawJSONString)
>>> #
>>> #join the structured JSON back onto the initial DF using the hi and
>>> lo join keys
>>> final = (raw.join(structuredJSON,
>>> ((raw['header.eventId.lo'] == structuredJSON['lo']) &
>>> (raw['header.eventId.hi'] == structuredJSON['hi'])),
>>> 'left_outer')
>>> .drop('hi')
>>> .drop('lo')
>>> )
>>> #
>>> #win
>>> return final
>>>
>>> On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon 
 wrote:

> Maybe it sounds like you are looking for from_json/to_json functions
> after en/decoding properly.
>

 Which are new built-in functions that will be released with Spark 2.1.

>>>
>>>
>>
>


Jackson Spark/app incompatibility and how to resolve it

2016-11-17 Thread Michael Allman
Hello,

I'm running into an issue with a Spark app I'm building, which depends on a 
library which depends on Jackson 2.8, which fails at runtime because Spark 
brings in Jackson 2.6. I'm looking for a solution. As a workaround, I've 
patched our build of Spark to use Jackson 2.8. That's working, however given 
all the trouble associated with attempting a Jackson upgrade in the past (see 
https://issues.apache.org/jira/browse/SPARK-14989 

 and https://github.com/apache/spark/pull/13417 
), I'm wondering if I should submit 
a PR for that. Is shading Spark's Jackson deps another option? Any other 
suggestions for an acceptable way to fix this incompatibility with apps using a 
newer version of Jackson?

FWIW, Jackson claims to support backward compatibility within minor releases 
(https://github.com/FasterXML/jackson-docs#on-jackson-versioning 
). So in 
theory, apps that depend on an upgraded Spark version should work even if they 
ask for an older version.

Cheers,

Michael

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Nice catch Suhas, and thanks for the reference. Sounds like we need a tweak
to the UI so this little feature is self-documenting.

Will file a JIRA, unless someone else wants to take this one and file the
JIRA themselves.

On Thu, Nov 17, 2016 at 12:21 PM Suhas Gaddam 
wrote:

> "Second, one of the RDDs is cached in the first stage (denoted by the
> green highlight). Since the enclosing operation involves reading from HDFS,
> caching this RDD means future computations on this RDD can access at least
> a subset of the original file from memory instead of from HDFS."
>
> from
> https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
>
> On Thu, Nov 17, 2016 at 9:19 AM, Reynold Xin  wrote:
>
> Ha funny. Never noticed that.
>
>
> On Thursday, November 17, 2016, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Hmm... somehow the image didn't show up.
>
> How about now?
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> Should I be able to see something?
>
> On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Some questions about this DAG visualization:
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> 1. What's the meaning of the green dot?
> 2. Should this be documented anywhere (if it isn't already)? Preferably a
> tooltip or something directly in the UI would explain the significance.
>
> Nick
>
>
>
>


Re: Green dot in web UI DAG visualization

2016-11-17 Thread Suhas Gaddam
"Second, one of the RDDs is cached in the first stage (denoted by the green
highlight). Since the enclosing operation involves reading from HDFS,
caching this RDD means future computations on this RDD can access at least
a subset of the original file from memory instead of from HDFS."

from
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

On Thu, Nov 17, 2016 at 9:19 AM, Reynold Xin  wrote:

> Ha funny. Never noticed that.
>
>
> On Thursday, November 17, 2016, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hmm... somehow the image didn't show up.
>>
>> How about now?
>>
>> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>>
>> On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> Should I be able to see something?
>>>
>>> On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Some questions about this DAG visualization:

 [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]

 1. What's the meaning of the green dot?
 2. Should this be documented anywhere (if it isn't already)? Preferably
 a tooltip or something directly in the UI would explain the significance.

 Nick


>>>


Re: Green dot in web UI DAG visualization

2016-11-17 Thread Reynold Xin
Ha funny. Never noticed that.

On Thursday, November 17, 2016, Nicholas Chammas 
wrote:

> Hmm... somehow the image didn't show up.
>
> How about now?
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com
> > wrote:
>
>> Should I be able to see something?
>>
>> On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com
>> > wrote:
>>
>>> Some questions about this DAG visualization:
>>>
>>> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>>>
>>> 1. What's the meaning of the green dot?
>>> 2. Should this be documented anywhere (if it isn't already)? Preferably
>>> a tooltip or something directly in the UI would explain the significance.
>>>
>>> Nick
>>>
>>>
>>


Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Hmm... somehow the image didn't show up.

How about now?

[image: Screen Shot 2016-11-17 at 11.57.14 AM.png]

On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

Should I be able to see something?

On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

Some questions about this DAG visualization:

[image: Screen Shot 2016-11-17 at 11.57.14 AM.png]

1. What's the meaning of the green dot?
2. Should this be documented anywhere (if it isn't already)? Preferably a
tooltip or something directly in the UI would explain the significance.

Nick


Re: Green dot in web UI DAG visualization

2016-11-17 Thread Herman van Hövell tot Westerflier
Should I be able to see something?

On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Some questions about this DAG visualization:
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> 1. What's the meaning of the green dot?
> 2. Should this be documented anywhere (if it isn't already)? Preferably a
> tooltip or something directly in the UI would explain the significance.
>
> Nick
>
>


Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Some questions about this DAG visualization:

[image: Screen Shot 2016-11-17 at 11.57.14 AM.png]

1. What's the meaning of the green dot?
2. Should this be documented anywhere (if it isn't already)? Preferably a
tooltip or something directly in the UI would explain the significance.

Nick


Re: structured streaming and window functions

2016-11-17 Thread HENSLEE, AUSTIN L
Forgive the slight tangent…

For anyone following this thread who may be wondering about a quick, simple 
solution they can apply (and a walk-through on how) for more straight-forward 
sessionization needs:

There’s a nice section on sessionization in “Advanced Analytics with Spark”, by 
Ryza, Laserson, Owen, and Wills (starts on p.167).

It was excellent for my job that needed to take events, get them in time order, 
and calculate the time between them (that particular job’s def of a “session”).

I used their groupByKeyAndSortValues() function.

As the authors state, “Work is progressing on Spark JIRA SPARK-3655 to add a 
transformation like this to core Spark.”

From: Ofir Manor 
Date: Thursday, November 17, 2016 at 9:57 AM
To: "assaf.mendelson" 
Cc: dev 
Subject: Re: structured streaming and window functions

I agree with you, I think that once we will have sessionization, we could aim 
for richer processing capabilities per session. As far as I image it, a session 
is an ordered sequence of data, that we could apply computation on it (like 
CEP).



Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: 
ofir.ma...@equalum.io

On Thu, Nov 17, 2016 at 5:16 PM, assaf.mendelson 
> wrote:
It is true that this is sessionizing but I brought it as an example for finding 
an ordered pattern in the data.
In general, using simple window (e.g. 24 hours) in structured streaming is 
explain in the grouping by time and is very clear.
What I was trying to figure out is how to do streaming of cases where you 
actually have to have some sorting to find patterns, especially when some of 
the data may come in late.
I was trying to figure out if there is plan to support this and if so, what 
would be the performance implications.
Assaf.

From: Ofir Manor [via Apache Spark Developers List] 
[mailto:ml-node+[hidden 
email]]
Sent: Thursday, November 17, 2016 5:13 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

Assaf, I think what you are describing is actually sessionizing, by user, where 
a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 
(didn't start yet)


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: tel:%2B972-54-7801286" 
value="+972507470820" 
target="_blank">+972-54-7801286 | Email: [hidden 
email]

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden 
email]> wrote:
Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do 
is for each user we would want to add the number of failed logins for each 
successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the 
future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] 
[mailto:[hidden 
email][hidden 
email]]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only 
supports time window aggregates, not the more general sql window function 
(sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the 
aggregation buffer (not the end result) in a state store after each increment. 
When an new batch comes in, you perform aggregation on that batch, merge the 
result of that aggregation with the buffer in the state store, update the state 
store and return the new result.

This is much harder than it sounds, because you need to maintain state in a 
fault tolerant way and you need to have some eviction policy (watermarks for 
instance) for aggregation buffers to prevent the state store from reaching an 
infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden 
email]> wrote:
Hi,
I have been trying to figure out how structured streaming handles window 
functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by 
the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original 
data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the 
original data and rerun on it to calculate the window function 

Fwd: SparkILoop doesn't run

2016-11-17 Thread Mohit Jaggi
I am trying to use SparkILoop to write some tests(shown below) but the test
hangs with the following stack trace. Any idea what is going on?


import org.apache.log4j.{Level, LogManager}
import org.apache.spark.repl.SparkILoop
import org.scalatest.{BeforeAndAfterAll, FunSuite}

class SparkReplSpec extends FunSuite with BeforeAndAfterAll {

  override def beforeAll(): Unit = {
  }

  override def afterAll(): Unit = {
  }

  test("yay!") {
val rootLogger = LogManager.getRootLogger
val logLevel = rootLogger.getLevel
rootLogger.setLevel(Level.ERROR)

val output = SparkILoop.run(
  """
|println("hello")
  """.stripMargin)

println(s" $output ")

  }
}


/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/bin/java
-Dspark.master=local[*] -Didea.launcher.port=7532
"-Didea.launcher.bin.path=/Applications/IntelliJ IDEA CE.app/Contents/bin"
-Dfile.encoding=UTF-8 -classpath "/Users/mohit/Library/Application
Support/IdeaIC2016.2/Scala/lib/scala-plugin-runners.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/deploy.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/jdk1.
8.0_66.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/
Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/
lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/jdk1.
8.0_66.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/
Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/
lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.
8.0_66.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/
Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/
lib/javaws.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/
Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/
lib/plugin.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre/lib/
rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/dt.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/
javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/
packager.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_
66.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/
JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/
tools.jar:/Users/mohit/code/datagears-play/target/scala-2.
11/test-classes:/Users/mohit/code/datagears-play/target/
scala-2.11/classes:/Users/mohit/code/datagears-play/
macros/target/scala-2.11/classes:/Users/mohit/.ivy2/cache/org.xerial.snappy/
snappy-java/bundles/snappy-java-1.1.2.4.jar:/Users/mohit/
.ivy2/cache/org.apache.spark/spark-unsafe_2.11/jars/spark-
unsafe_2.11-2.0.0.jar:/Users/mohit/.ivy2/cache/org.apache.
spark/spark-tags_2.11/jars/spark-tags_2.11-2.0.0.jar:/
Users/mohit/.ivy2/cache/org.apache.spark/spark-streaming_
2.11/jars/spark-streaming_2.11-2.0.0.jar:/Users/mohit/.
ivy2/cache/org.apache.spark/spark-sql_2.11/jars/spark-sql_
2.11-2.0.0.jar:/Users/mohit/.ivy2/cache/org.apache.spark/
spark-sketch_2.11/jars/spark-sketch_2.11-2.0.0.jar:/Users/
mohit/.ivy2/cache/org.apache.spark/spark-repl_2.11/jars/
spark-repl_2.11-2.0.0.jar:/Users/mohit/.ivy2/cache/org.
apache.spark/spark-network-shuffle_2.11/jars/spark-
network-shuffle_2.11-2.0.0.jar:/Users/mohit/.ivy2/cache/
org.apache.spark/spark-network-common_2.11/jars/
spark-network-common_2.11-2.0.0.jar:/Users/mohit/.ivy2/
cache/org.apache.spark/spark-mllib_2.11/jars/spark-mllib_2.
11-2.0.0.jar:/Users/mohit/.ivy2/cache/org.apache.spark/
spark-mllib-local_2.11/jars/spark-mllib-local_2.11-2.0.0.
jar:/Users/mohit/.ivy2/cache/org.apache.spark/spark-
launcher_2.11/jars/spark-launcher_2.11-2.0.0.jar:/
Users/mohit/.ivy2/cache/org.apache.spark/spark-graphx_2.
11/jars/spark-graphx_2.11-2.0.0.jar:/Users/mohit/.ivy2/
cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.

Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
I agree with you, I think that once we will have sessionization, we could
aim for richer processing capabilities per session. As far as I image it, a
session is an ordered sequence of data, that we could apply computation on
it (like CEP).


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Nov 17, 2016 at 5:16 PM, assaf.mendelson 
wrote:

> It is true that this is sessionizing but I brought it as an example for
> finding an ordered pattern in the data.
>
> In general, using simple window (e.g. 24 hours) in structured streaming is
> explain in the grouping by time and is very clear.
>
> What I was trying to figure out is how to do streaming of cases where you
> actually have to have some sorting to find patterns, especially when some
> of the data may come in late.
>
> I was trying to figure out if there is plan to support this and if so,
> what would be the performance implications.
>
> Assaf.
>
>
>
> *From:* Ofir Manor [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Thursday, November 17, 2016 5:13 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> Assaf, I think what you are describing is actually sessionizing, by user,
> where a session is ended by a successful login event.
>
> On each session, you want to count number of failed login events.
>
> If so, this is tracked by https://issues.apache.org/
> jira/browse/SPARK-10816 (didn't start yet)
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: tel:%2B972-54-7801286"
> value="+972507470820" target="_blank">+972-54-7801286 | Email: [hidden
> email] 
>
>
>
> On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden email]
> > wrote:
>
> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:[hidden email]
> [hidden email]
> ]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> > wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> --
>
> View this message in context: structured streaming and window functions
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19933.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
It is true that this is sessionizing but I brought it as an example for finding 
an ordered pattern in the data.
In general, using simple window (e.g. 24 hours) in structured streaming is 
explain in the grouping by time and is very clear.
What I was trying to figure out is how to do streaming of cases where you 
actually have to have some sorting to find patterns, especially when some of 
the data may come in late.
I was trying to figure out if there is plan to support this and if so, what 
would be the performance implications.
Assaf.

From: Ofir Manor [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19935...@n3.nabble.com]
Sent: Thursday, November 17, 2016 5:13 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

Assaf, I think what you are describing is actually sessionizing, by user, where 
a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 
(didn't start yet)


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: [hidden 
email]

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden 
email]> wrote:
Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do 
is for each user we would want to add the number of failed logins for each 
successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the 
future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] 
[mailto:[hidden email][hidden 
email]]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only 
supports time window aggregates, not the more general sql window function 
(sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the 
aggregation buffer (not the end result) in a state store after each increment. 
When an new batch comes in, you perform aggregation on that batch, merge the 
result of that aggregation with the buffer in the state store, update the state 
store and return the new result.

This is much harder than it sounds, because you need to maintain state in a 
fault tolerant way and you need to have some eviction policy (watermarks for 
instance) for aggregation buffers to prevent the state store from reaching an 
infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden 
email]> wrote:
Hi,
I have been trying to figure out how structured streaming handles window 
functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by 
the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original 
data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the 
original data and rerun on it to calculate the window function every time new 
data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.


View this message in context: structured streaming and window 
functions
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19933.html
To start a new topic under Apache Spark Developers List, email [hidden 
email]
To unsubscribe from Apache Spark Developers List, click here.
NAML


View this message in context: RE: structured streaming and window 
functions

Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.



If you reply to this email, your 

Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
Assaf, I think what you are describing is actually sessionizing, by user,
where a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816
(didn't start yet)

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson 
wrote:

> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:ml-node+[hidden email]
> ]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> > wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> --
>
> View this message in context: structured streaming and window functions
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19933.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: structured streaming and window
> functions
> 
>
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Is there a plan to support sql window functions?
I will give an example of use: Let’s say we have login logs. What we want to do 
is for each user we would want to add the number of failed logins for each 
successful login. How would you do it with structured streaming?
As this is currently not supported, is there a plan on how to support it in the 
future?
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19933...@n3.nabble.com]
Sent: Thursday, November 17, 2016 1:27 PM
To: Mendelson, Assaf
Subject: Re: structured streaming and window functions

What kind of window functions are we talking about? Structured streaming only 
supports time window aggregates, not the more general sql window function 
(sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the 
aggregation buffer (not the end result) in a state store after each increment. 
When an new batch comes in, you perform aggregation on that batch, merge the 
result of that aggregation with the buffer in the state store, update the state 
store and return the new result.

This is much harder than it sounds, because you need to maintain state in a 
fault tolerant way and you need to have some eviction policy (watermarks for 
instance) for aggregation buffers to prevent the state store from reaching an 
infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden 
email]> wrote:
Hi,
I have been trying to figure out how structured streaming handles window 
functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by 
the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original 
data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the 
original data and rerun on it to calculate the window function every time new 
data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.


View this message in context: structured streaming and window 
functions
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19933.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: structured streaming and window functions

2016-11-17 Thread Herman van Hövell tot Westerflier
What kind of window functions are we talking about? Structured streaming
only supports time window aggregates, not the more general sql window
function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the
aggregation buffer (not the end result) in a state store after each
increment. When an new batch comes in, you perform aggregation on that
batch, merge the result of that aggregation with the buffer in the state
store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a
fault tolerant way and you need to have some eviction policy (watermarks
for instance) for aggregation buffers to prevent the state store from
reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson 
wrote:

> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
> --
> View this message in context: structured streaming and window functions
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: Another Interesting Question on SPARK SQL

2016-11-17 Thread Herman van Hövell tot Westerflier
The diagram you have included, is a depiction of the steps Catalyst (the
spark optimizer) takes to create an executable plan. Tungsten mainly comes
into play during code generation and the actual execution.

A datasource is represented by a LogicalRelation during analysis &
optimization. The spark planner takes such a LogicalRelation and plans it
as either RowDataSourceScanExec or an BatchedDataSourceScanExec depending
on the datasource. Both scan nodes support whole stage code generation.

HTH


On Thu, Nov 17, 2016 at 1:28 AM, kant kodali  wrote:

>
> ​
> Which parts in the diagram above are executed by DataSource connectors and
> which parts are executed by Tungsten? or to put it in another way which
> phase in the diagram above does Tungsten leverages the Datasource
> connectors (such as say cassandra connector ) ?
>
> My understanding so far is that connectors come in during Physical
> planning phase but I am not sure if the connectors take logical plan as an
> input?
>
> Thanks,
> kant
>


Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​
Which parts in the diagram above are executed by DataSource connectors and
which parts are executed by Tungsten? or to put it in another way which
phase in the diagram above does Tungsten leverages the Datasource
connectors (such as say cassandra connector ) ?

My understanding so far is that connectors come in during Physical planning
phase but I am not sure if the connectors take logical plan as an input?

Thanks,
kant


structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Hi,
I have been trying to figure out how structured streaming handles window 
functions efficiently.
The portion I understand is that whenever new data arrived, it is grouped by 
the time and the aggregated data is added to the state.
However, unlike operations like sum etc. window functions need the original 
data and can change when data arrives late.
So if I understand correctly, this would mean that we would have to save the 
original data and rerun on it to calculate the window function every time new 
data arrives.
Is this correct? Are there ways to go around this issue?

Assaf.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.