Re: [k8s] Spark operator (the Java one)

2019-10-10 Thread Stavros Kontopoulos
Hi all,

I also left a comment on the PR with more details. I dont see why the java
operator should be maintained by the Spark project.
This is an interesting project and could thrive on its own as an external
operator project.

Best,
Stavros

On Thu, Oct 10, 2019 at 7:51 PM Sean Owen  wrote:

> I'd have the same question on the PR - why does this need to be in the
> Apache Spark project vs where it is now? Yes, it's not a Spark package
> per se, but it seems like this is a tool for K8S to use Spark rather
> than a core Spark tool.
>
> Yes of course all the packages, licenses, etc have to be overhauled,
> but that kind of underscores that this is a dump of a third party tool
> that works fine on its own?
>
> On Thu, Oct 10, 2019 at 9:30 AM Jiri Kremser  wrote:
> >
> > Hello,
> >
> >
> > Spark Operator is a tool that can deploy/scale and help with monitoring
> of Spark clusters on Kubernetes. It follows the operator pattern [1]
> introduced by CoreOS so it watches for changes in custom resources
> representing the desired state of the clusters and does the steps to
> achieve this state in the Kubernetes by using the K8s client. It’s written
> in Java and there is an overlap with the spark dependencies (logging, k8s
> client, apache-commons-*, fasterxml-jackson, etc.). The operator contains
> also metadata that allows it to deploy smoothly using the operatorhub.io
> [2]. For a very basic info, check the readme on the project page including
> the gif :) Other unique feature to this operator is the ability (it’s
> optional) to compile itself to a native image using GraalVM compiler to be
> able to start fast and have a very low memory footprint.
> >
> >
> > We would like to contribute this project to Spark’s code base. It can’t
> be distributed as a spark package, because it’s not a library that can be
> used from Spark environment. So if you are interested, the directory under
> resource-managers/kubernetes/spark-operator/ could be a suitable
> destination.
> >
> >
> > The current repository is radanalytics/spark-operator [2] on GitHub and
> it contains also a test suite [3] that verifies if the operator can work
> well on K8s (using minikube) and also on OpenShift. I am not sure how to
> transfer those tests in case you would be interested in those as well.
> >
> >
> > I’ve already opened the PR [5], but it got closed, so I am opening the
> discussion here first. The PR contained old package names with our
> organisation called radanalytics.io but we are willing to change that to
> anything that will be more aligned with the existing Spark conventions,
> same holds for the license headers in all the source files.
> >
> >
> > jk
> >
> >
> >
> > [1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
> >
> > [2]: https://operatorhub.io/operator/radanalytics-spark
> >
> > [3]: https://github.com/radanalyticsio/spark-operator
> >
> > [4]: https://travis-ci.org/radanalyticsio/spark-operator
> >
> > [5]: https://github.com/apache/spark/pull/26075
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Stavros Kontopoulos
+1 as a contributor and as a user. Given the amount of testing required for
all the new cool stuff like java 11 support, major
refactorings/deprecations etc, a preview version would help a lot the
community making adoption smoother long term. I would also add to the list
of issues, Scala 2.13 support (
https://issues.apache.org/jira/browse/SPARK-25075) assuming things will
move forward faster the next few months.

On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko 
wrote:

> Michael Heuer, that's an interesting issue.
>
> 1.8.2 to 1.9.0 is almost binary compatible (94%):
> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
> Most of the stuff is removing the Jackson and Netty API from Avro's public
> API and deprecating the Joda library. I would strongly advise moving to
> 1.9.1 since there are some regression issues, for Java most important:
> https://jira.apache.org/jira/browse/AVRO-2400
>
> I'd love to dive into the issue that you describe and I'm curious if the
> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
> might have some time this weekend to dive into it.
>
> Cheers, Fokko Driesprong
>
>
> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin :
>
>> +1! Long due for a preview release.
>>
>>
>> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau 
>> wrote:
>>
>>> I like the idea from the PoV of giving folks something to start testing
>>> against and exploring so they can raise issues with us earlier in the
>>> process and we have more time to make calls around this.
>>>
>>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge  wrote:
>>>
>>> +1  Like the idea as a user and a DSv2 contributor.
>>>
>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim  wrote:
>>>
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we are
>>> intended to introduce new features before official release, that should
>>> work regardless of this, but if we are intended to have opportunity to test
>>> earlier, ideally it should.
>>>
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>>
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>>
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp
>>> to start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>>
>>> There're some more new features/improvements items in SS, but given
>>> we're talking about ramping-down, above list might be realistic one.
>>>
>>>
>>>
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin  wrote:
>>>
>>> As a user/non committer, +1
>>>
>>> I love the idea of an early 3.0.0 so we can test current dev against it,
>>> I know the final 3.x will probably need another round of testing when it
>>> gets out, but less for sure... I know I could checkout and compile, but
>>> having a “packaged” preversion is great if it does not take too much time
>>> to the team...
>>>
>>> jg
>>>
>>>
>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
>>>
>>> +1 from me too but I would like to know what other people think too.
>>>
>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>>>
>>> Thank you, Sean.
>>>
>>> I'm also +1 for the following three.
>>>
>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> 2. Apache Spark 3.0.0-preview in 2019
>>> 3. Apache Spark 3.0.0 in early 2020
>>>
>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>> it a lot.
>>>
>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>> Window` in our versioning-policy page?
>>>
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer 
>>> wrote:
>>>
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>> problems resolved, e.g.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>> as I know, Parquet has 

Re: Python API for mapGroupsWithState

2019-09-11 Thread Stavros Kontopoulos
+1 I was looking at this today, so any idea why this was not added before?

On Sat, Aug 3, 2019 at 1:57 AM Nicholas Chammas 
wrote:

> Can someone succinctly describe the challenge in adding the
> `mapGroupsWithState()` API to PySpark?
>
> I was hoping for some suboptimal but nonetheless working solution to be
> available in Python, as there are with Python UDFs for example, but that
> doesn't seem to be case. The JIRA ticket for arbitrary stateful
> operations in Structured Streaming
> <https://issues.apache.org/jira/browse/SPARK-19067> doesn't give any
> indication that a Python version of the API is coming.
>
> Is this something that will likely be added in the near future, or is it a
> major undertaking? Can someone briefly describe the problem?
>
> Nick
>
>

-- 
Stavros Kontopoulos
*Principal Engineer*
*Lightbend Platform <https://www.lightbend.com/lightbend-platform>*
*mob: **+30 6977967274 <+30+6977967274>*


Re: Welcoming some new committers and PMC members

2019-09-10 Thread Stavros Kontopoulos
Congrats! Well deserved.

On Tue, Sep 10, 2019 at 1:20 PM Driesprong, Fokko 
wrote:

> Congrats all, well deserved!
>
>
> Cheers, Fokko
>
> Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi <
> gabor.g.somo...@gmail.com>:
>
>> Congrats Guys!
>>
>> G
>>
>>
>> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add several new committers and one PMC
>>> member. Join me in welcoming them to their new roles!
>>>
>>> New PMC member: Dongjoon Hyun
>>>
>>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>>> Weichen Xu, Ruifeng Zheng
>>>
>>> The new committers cover lots of important areas including ML, SQL, and
>>> data sources, so it’s great to have them here. All the best,
>>>
>>> Matei and the Spark PMC
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

--


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Stavros Kontopoulos
Hi Dongjoon,

Should we also consider fixing
https://issues.apache.org/jira/browse/SPARK-27812 before the cut?

Best,
Stavros

On Mon, Jul 15, 2019 at 7:04 PM Dongjoon Hyun 
wrote:

> Hi, Apache Spark PMC members.
>
> Can we cut Apache Spark 2.4.4 next Monday (22nd July)?
>
> Bests,
> Dongjoon.
>
>
> On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Jacek.
>>
>> BTW, I added `@private` since we need PMC's help to make an Apache Spark
>> release.
>>
>> Can I get more feedbacks from the other PMC members?
>>
>> Please me know if you have any concerns (e.g. Release date or Release
>> manager?)
>>
>> As one of the community members, I assumed the followings (if we are on
>> schedule).
>>
>> - 2.4.4 at the end of July
>> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
>> February 2018)
>> - 3.0.0 (possibily September?)
>> - 3.1.0 (January 2020?)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Thanks Dongjoon Hyun for stepping up as a release manager!
>>> Much appreciated.
>>>
>>> If there's a volunteer to cut a release, I'm always to support it.
>>>
>>> In addition, the more frequent releases the better for end users so they
>>> have a choice to upgrade and have all the latest fixes or wait. It's their
>>> call not ours (when we'd keep them waiting).
>>>
>>> My big 2 yes'es for the release!
>>>
>>> Jacek
>>>
>>>
>>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, All.

 Spark 2.4.3 was released two months ago (8th May).

 As of today (9th July), there exist 45 fixes in `branch-2.4` including
 the following correctness or blocker issues.

 - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
 decimals not fitting in long
 - SPARK-26045 Error in the spark 2.4 release package with the
 spark-avro_2.11 dependency
 - SPARK-27798 from_avro can modify variables in other rows in local
 mode
 - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
 - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
 entries
 - SPARK-28308 CalendarInterval sub-second part should be padded
 before parsing

 It would be great if we can have Spark 2.4.4 before we are going to get
 busier for 3.0.0.
 If it's okay, I'd like to volunteer for an 2.4.4 release manager to
 roll it next Monday. (15th July).
 How do you think about this?

 Bests,
 Dongjoon.

>>>


Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Stavros Kontopoulos
I can try one and see how it goes, although not familiar with the area.

Stavros

On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
> cases by integrating our existing *.sql files at
> https://issues.apache.org/jira/browse/SPARK-27921
>
> I would appreciate that anyone who's interested in Spark contribution
> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
> for now.
>
> I wrote some guides about this umbrella JIRA specifically so if you're
> able to follow it very closely one by one, I think the process itself isn't
> that difficult.
>
> The most import guide that should be carefully addressed is:
> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
> with comments.
>
> Thanks!
>


Re: Support SqlStreaming in spark

2019-06-03 Thread Stavros Kontopoulos
Hi all,
>From what I read there is an effort here to globally standardize SQL
Streaming (Flink people, Google at others are working with SQL
standardization body) https://arxiv.org/abs/1905.12133v1
should
Spark community be part of it?

Best,
Stavros

On Thu, Mar 28, 2019 at 12:03 PM uncleGen  wrote:

> Hi all,
>
> I have rewritten the design doc based on previous discussing.
>
> https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0
>
> Would be interested to hear what others think.
>
> Regards,
> Genmao Yu
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: dynamic allocation manager in SS

2019-05-27 Thread Stavros Kontopoulos
Sure im not talking about k8s here.
The discussion is about the heuristics and their drawbacks.

Στις Δευ, 27 Μαΐ 2019, 2:04 μ.μ. ο χρήστης Gabor Somogyi <
gabor.g.somo...@gmail.com> έγραψε:

> K8s is a different story, please take a look at the doc "Future Work" part.
>
> On Fri, May 24, 2019 at 9:40 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Btw the heuristics for batch mode (
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L289)
>> vs
>> streaming (
>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L91-L92)
>> are different. In batch mode you care about the numRunningOrPendingTasks 
>> while
>> for streaming about the ratio: averageBatchProcTime.toDouble /
>> batchDurationMs so there are some concerns beyond scaling down when
>> idle.
>> A scenario things might now work for batch dynamic allocation with SS is
>> as follows. I start with a query that reads x kafka partitions and the data
>> arriving is low and all tasks (1 per partition) are running since there are
>> enough resources anyway.
>> At some point the data increases per partition (maxOffsetsPerTrigger is
>> high enough) and so processing takes more time. AFAIK SS will wait for a
>> batch to finish before running the next (waits for the trigger to finish,
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L46
>> ).
>> In this case I suspect there is no scaling up with the batch dynamic
>> allocation mode as there are no pending tasks, only processing time
>> changed. In this case the streaming dynamic heuristics I think are better.
>> Batch mode heuristics could work, if not mistaken, if you have multiple
>> streaming queries and there are batches waiting (using fair-scheduling etc).
>>
>> PS. this has been discussed, not in depth, in the past on the list (
>> https://mail-archives.apache.org/mod_mbox/spark-user/201708.mbox/%3c1503626484779-29104.p...@n3.nabble.com%3E
>> )
>>
>>
>>
>>
>> On Fri, May 24, 2019 at 9:22 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> I am on k8s where there is no support yet afaik, there is wip wrt the
>>> shuffle service. So from your experience there are no issues with using the
>>> batch dynamic allocation version like there was before with dstreams as
>>> described in the related jira?
>>>
>>> Στις Παρ, 24 Μαΐ 2019, 8:28 μ.μ. ο χρήστης Gabor Somogyi <
>>> gabor.g.somo...@gmail.com> έγραψε:
>>>
>>>> It scales down with yarn. Not sure how you've tested.
>>>>
>>>> On Fri, 24 May 2019, 19:10 Stavros Kontopoulos, <
>>>> stavros.kontopou...@lightbend.com> wrote:
>>>>
>>>>> Yes nothing happens. In this case it could propagate info to the
>>>>> resource manager to scale down the number of executors no? Just a thought.
>>>>>
>>>>> Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
>>>>> gabor.g.somo...@gmail.com> έγραψε:
>>>>>
>>>>>> Structured Streaming works differently. If no data arrives no tasks
>>>>>> are executed (just had a case in this area).
>>>>>>
>>>>>> BR,
>>>>>> G
>>>>>>
>>>>>>
>>>>>> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
>>>>>> stavros.kontopou...@lightbend.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Some while ago the streaming dynamic allocation part was added in
>>>>>>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to
>>>>>>> improve the issues with the batch based one. Should this be ported
>>>>>>> to structured streaming? Thoughts?
>>>>>>> AFAIK there is no support in SS for it.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stavros
>>>>>>>
>>>>>>>
>>
>> --
>> Stavros Kontopoulos
>> *Principal Engineer*
>> *Lightbend Platform <https://www.lightbend.com/lightbend-platform>*
>> *mob: **+30 6977967274 <+30+6977967274>*
>>
>>


Re: dynamic allocation manager in SS

2019-05-24 Thread Stavros Kontopoulos
Btw the heuristics for batch mode (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L289)
vs
streaming (
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L91-L92)
are different. In batch mode you care about the numRunningOrPendingTasks while
for streaming about the ratio: averageBatchProcTime.toDouble /
batchDurationMs so there are some concerns beyond scaling down when idle.
A scenario things might now work for batch dynamic allocation with SS is as
follows. I start with a query that reads x kafka partitions and the data
arriving is low and all tasks (1 per partition) are running since there are
enough resources anyway.
At some point the data increases per partition (maxOffsetsPerTrigger is
high enough) and so processing takes more time. AFAIK SS will wait for a
batch to finish before running the next (waits for the trigger to finish,
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala#L46
).
In this case I suspect there is no scaling up with the batch dynamic
allocation mode as there are no pending tasks, only processing time
changed. In this case the streaming dynamic heuristics I think are better.
Batch mode heuristics could work, if not mistaken, if you have multiple
streaming queries and there are batches waiting (using fair-scheduling etc).

PS. this has been discussed, not in depth, in the past on the list (
https://mail-archives.apache.org/mod_mbox/spark-user/201708.mbox/%3c1503626484779-29104.p...@n3.nabble.com%3E
)




On Fri, May 24, 2019 at 9:22 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> I am on k8s where there is no support yet afaik, there is wip wrt the
> shuffle service. So from your experience there are no issues with using the
> batch dynamic allocation version like there was before with dstreams as
> described in the related jira?
>
> Στις Παρ, 24 Μαΐ 2019, 8:28 μ.μ. ο χρήστης Gabor Somogyi <
> gabor.g.somo...@gmail.com> έγραψε:
>
>> It scales down with yarn. Not sure how you've tested.
>>
>> On Fri, 24 May 2019, 19:10 Stavros Kontopoulos, <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Yes nothing happens. In this case it could propagate info to the
>>> resource manager to scale down the number of executors no? Just a thought.
>>>
>>> Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
>>> gabor.g.somo...@gmail.com> έγραψε:
>>>
>>>> Structured Streaming works differently. If no data arrives no tasks are
>>>> executed (just had a case in this area).
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
>>>> stavros.kontopou...@lightbend.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Some while ago the streaming dynamic allocation part was added in
>>>>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to
>>>>> improve the issues with the batch based one. Should this be ported to
>>>>> structured streaming? Thoughts?
>>>>> AFAIK there is no support in SS for it.
>>>>>
>>>>> Best,
>>>>> Stavros
>>>>>
>>>>>

-- 
Stavros Kontopoulos
*Principal Engineer*
*Lightbend Platform <https://www.lightbend.com/lightbend-platform>*
*mob: **+30 6977967274 <+30+6977967274>*


Re: dynamic allocation manager in SS

2019-05-24 Thread Stavros Kontopoulos
I am on k8s where there is no support yet afaik, there is wip wrt the
shuffle service. So from your experience there are no issues with using the
batch dynamic allocation version like there was before with dstreams as
described in the related jira?

Στις Παρ, 24 Μαΐ 2019, 8:28 μ.μ. ο χρήστης Gabor Somogyi <
gabor.g.somo...@gmail.com> έγραψε:

> It scales down with yarn. Not sure how you've tested.
>
> On Fri, 24 May 2019, 19:10 Stavros Kontopoulos, <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Yes nothing happens. In this case it could propagate info to the resource
>> manager to scale down the number of executors no? Just a thought.
>>
>> Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
>> gabor.g.somo...@gmail.com> έγραψε:
>>
>>> Structured Streaming works differently. If no data arrives no tasks are
>>> executed (just had a case in this area).
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
>>> stavros.kontopou...@lightbend.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Some while ago the streaming dynamic allocation part was added in
>>>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to
>>>> improve the issues with the batch based one. Should this be ported to
>>>> structured streaming? Thoughts?
>>>> AFAIK there is no support in SS for it.
>>>>
>>>> Best,
>>>> Stavros
>>>>
>>>>


Re: dynamic allocation manager in SS

2019-05-24 Thread Stavros Kontopoulos
Yes nothing happens. In this case it could propagate info to the resource
manager to scale down the number of executors no? Just a thought.

Στις Παρ, 24 Μαΐ 2019, 7:17 μ.μ. ο χρήστης Gabor Somogyi <
gabor.g.somo...@gmail.com> έγραψε:

> Structured Streaming works differently. If no data arrives no tasks are
> executed (just had a case in this area).
>
> BR,
> G
>
>
> On Fri, 24 May 2019, 18:14 Stavros Kontopoulos, <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Hi,
>>
>> Some while ago the streaming dynamic allocation part was added in
>> DStreams(https://issues.apache.org/jira/browse/SPARK-12133)  to improve
>> the issues with the batch based one. Should this be ported to structured
>> streaming? Thoughts?
>> AFAIK there is no support in SS for it.
>>
>> Best,
>> Stavros
>>
>>


dynamic allocation manager in SS

2019-05-24 Thread Stavros Kontopoulos
Hi,

Some while ago the streaming dynamic allocation part was added in DStreams(
https://issues.apache.org/jira/browse/SPARK-12133)  to improve the issues
with the batch based one. Should this be ported to structured streaming?
Thoughts?
AFAIK there is no support in SS for it.

Best,
Stavros


Re: [METRICS] Metrics names inconsistent between executions

2019-05-07 Thread Stavros Kontopoulos
Hi,

With jmx_exporter  and
Prometheus you can always re-write the metrics patterns on the fly. Btw if
you use Grafana its easy to filter things even without the re-write.
If this is a custom dashboard you can always group metrics based on the
spark.app.id as a prefix, no? Also I think some times its good to know if
some executor
failed and why and report specific execution metrics. For example if you
have skewed data and that caused jvm issues etc.

Stavros
On Mon, May 6, 2019 at 11:29 PM Anton Kirillov 
wrote:

> Hi everyone!
>
> We are currently working on building a unified monitoring/alerting
> solution for Spark and would like to rely on Spark's own metrics to avoid
> divergence from the upstream. One of the challenges is to support metrics
> coming from multiple Spark applications running on a cluster: scheduled
> jobs, long-running streaming applications etc.
>
> Original problem:
> Spark assigns metrics names using *spark.app.id *
> and *spark.executor.id * as a part of them.
> Thus the number of metrics is continuously growing because those IDs are
> unique between executions whereas the metrics themselves report the same
> thing. Another issue which arises here is how to use constantly changing
> metric names in dashboards.
>
> For example, *jvm_heap_used* reported by all Spark instances (components):
> - _driver_jvm_heap_used (Driver)
> - __jvm_heap_used (Executors)
>
> While *spark.app.id * can be overridden with
> *spark.metrics.namespace*, there's no such an option for *spark.executor.id
> * which makes it impossible to build a reusable
> dashboard because (given the uniqueness of IDs) differently named metrics
> are emitted for each execution.
>
> One of the possible solutions would be to make executor metrics names
> follow the driver's metrics name pattern, e.g.:
> - _driver_jvm_heap_used (Driver)
> - _executor_jvm_heap_used (Executors)
>
> and distinguish executors based on tags (tags should be configured in
> metric reporters in this case). Not sure if this could potentially break
> Driver UI though.
>
> I'd really appreciate any feedback on this issue and would be happy to
> create a Jira issue/PR if this change looks sane for the community.
>
> Thanks in advance.
>
> --
> *Anton Kirillov*
> Senior Software Engineer, Mesosphere
>


Re: queryable state & streaming

2019-04-24 Thread Stavros Kontopoulos
Michael,
I have listed used cases above should we proceed with a design doc?

Best,
Stavros

Στις Δευ, 18 Μαρ 2019, 12:21 μ.μ. ο χρήστης Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> έγραψε:

> Not really, if we agree that we want this, I can put together a design
> document and take it from there. There was also a discussion in another
> thread about adding RockDB as a memory storage that is related to this task.
>
> Best,
> Stavros
>
> On Sun, Mar 17, 2019 at 4:42 AM kant kodali  wrote:
>
>> Any update on this?
>>
>> On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan  wrote:
>>
>>> I don't think separate API or RPCs etc might be necessary for queryable
>>> state if the state can be exposed as just another datasource. Then the sql
>>> queries can be issued against it just like executing sql queries against
>>> any other data source.
>>>
>>> For now I think the "memory" sink could be used  as a sink and run
>>> queries against it but I agree it does not scale for large states.
>>>
>>> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim  wrote:
>>>
>>>> It doesn't seem Spark has workarounds other than storing output into
>>>> external storages, so +1 on having this.
>>>>
>>>> My major concern on implementing queryable state in structured
>>>> streaming is "Are all states available on executors at any time while query
>>>> is running?" Querying state shouldn't affect the running query. Given that
>>>> state is huge and default state provider is loading state in memory, we may
>>>> not want to load one more redundant snapshot of state: we want to always
>>>> load "current state" which query is also using. (For sure, Queryable state
>>>> should be read-only.)
>>>>
>>>> Regarding improvement of local state, I guess it is ideal to leverage
>>>> embedded db, like Kafka and Flink are doing. The difference will not be
>>>> only reading state from non-heap, but also how to take a snapshot and store
>>>> delta. We may want to check snapshotting works well with small batch
>>>> interval, and find alternative approach when it doesn't. Sounds like it is
>>>> a huge item and can be handled individually.
>>>>
>>>> - Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <
>>>> st.kontopou...@gmail.com>님이 작성:
>>>>
>>>>> Nice I was looking for a jira. So I agree we should justify why we are
>>>>> building something. Now to that direction here is what I have seen from my
>>>>> experience.
>>>>> People quite often use state within their streaming app and may have
>>>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>>>> Cassandra for example for serving) is an advantage, in terms of at least
>>>>> latency and complexity.
>>>>> This can be true if we advantage of state checkpointing (locally could
>>>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>>>> with an API to efficiently query data.
>>>>> Some use cases I see:
>>>>>
>>>>> - real-time dashboards and real-time reporting, the faster the better
>>>>> - monitoring of state for operational reasons, app health etc...
>>>>> - integrating with external services via an API eg. making accessible
>>>>>  aggregations over time windows to some third party service within your
>>>>> system
>>>>>
>>>>> Regarding requirements here are some of them:
>>>>> - support of an API to expose state (could be done at the spark
>>>>> driver), like rest.
>>>>> - supporting dynamic allocation (not sure how it affects state
>>>>> management)
>>>>> - an efficient way to talk to executors to get the state (rpc?)
>>>>> - making local state more efficient and easier accessible with an
>>>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>>>> Some people are already working with such techs and some stuff could
>>>>> be re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>>>
>>>>> Best,
>>>>> Stavros
>>>>>
>>>>>
>>>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>&

Re: JDK vs JRE in Docker Images

2019-04-17 Thread Stavros Kontopoulos
Hi Rob,

We are using registry.redhat.io/redhat-openjdk-18/openjdk18-openshift (
https://docs.openshift.com/online/using_images/s2i_images/java.html)
It looks most convenient as Red Hat leads the openjdk updates which is even
more important from now on and also from a security
point of view.
There are some tools you might want to use at runtime like jstack, jps when
debugging apps so it might be more convenient to have jdk but
shouldnt be a requirement unless Spark does any compilation on the fly
behind the scenes (besides use of janino) or you need to use a tool like
keytool at container startup.

Best,
Stavros

On Wed, Apr 17, 2019 at 4:49 PM Rob Vesse  wrote:

> Folks
>
>
>
> For those using the Kubernetes support and building custom images are you
> using a JDK or a JRE in the container images?
>
>
>
> Using a JRE saves a reasonable chunk of image size (about 50MB with our
> preferred Linux distro) but I didn’t want to make this change if there was
> a reason to have a JDK available.  Certainly the official project
> integration tests run just fine with a JRE based image
>
>
>
> Currently the projects official Docker files use openjdk:8-alpine as a
> base which includes a full JDK so didn’t know if that was intentional or
> just convenience?
>
>
>
> Thanks,
>
>
>
> Rob
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-20 Thread Stavros Kontopoulos
+1  (non-binding)

On Wed, Mar 20, 2019 at 8:33 AM Sean Owen  wrote:

> (Only the PMC can veto a release)
> That doesn't look like a regression. I get that it's important, but I
> don't see that it should block this release.
>
> On Tue, Mar 19, 2019 at 11:00 PM Darcy Shen 
> wrote:
> >
> > -1
> >
> > please backpoart SPARK-27160, a correctness issue about ORC native
> reader.
> >
> > see https://github.com/apache/spark/pull/24092
> >
> >
> >  On Wed, 20 Mar 2019 06:21:29 +0800 DB Tsai 
> wrote 
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.1.
> >
> > The vote is open until March 23 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.1-rc8 (commit
> 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > https://github.com/apache/spark/tree/v2.4.1-rc8
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1318/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
> >
> > The list of bug fixes going into 2.4.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.1?
> > ===
> >
> > The current list of open tickets targeted at 2.4.1 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.1
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
> Inc
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: queryable state & streaming

2019-03-18 Thread Stavros Kontopoulos
Not really, if we agree that we want this, I can put together a design
document and take it from there. There was also a discussion in another
thread about adding RockDB as a memory storage that is related to this task.

Best,
Stavros

On Sun, Mar 17, 2019 at 4:42 AM kant kodali  wrote:

> Any update on this?
>
> On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan  wrote:
>
>> I don't think separate API or RPCs etc might be necessary for queryable
>> state if the state can be exposed as just another datasource. Then the sql
>> queries can be issued against it just like executing sql queries against
>> any other data source.
>>
>> For now I think the "memory" sink could be used  as a sink and run
>> queries against it but I agree it does not scale for large states.
>>
>> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim  wrote:
>>
>>> It doesn't seem Spark has workarounds other than storing output into
>>> external storages, so +1 on having this.
>>>
>>> My major concern on implementing queryable state in structured streaming
>>> is "Are all states available on executors at any time while query is
>>> running?" Querying state shouldn't affect the running query. Given that
>>> state is huge and default state provider is loading state in memory, we may
>>> not want to load one more redundant snapshot of state: we want to always
>>> load "current state" which query is also using. (For sure, Queryable state
>>> should be read-only.)
>>>
>>> Regarding improvement of local state, I guess it is ideal to leverage
>>> embedded db, like Kafka and Flink are doing. The difference will not be
>>> only reading state from non-heap, but also how to take a snapshot and store
>>> delta. We may want to check snapshotting works well with small batch
>>> interval, and find alternative approach when it doesn't. Sounds like it is
>>> a huge item and can be handled individually.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos 님이
>>> 작성:
>>>
>>>> Nice I was looking for a jira. So I agree we should justify why we are
>>>> building something. Now to that direction here is what I have seen from my
>>>> experience.
>>>> People quite often use state within their streaming app and may have
>>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>>> Cassandra for example for serving) is an advantage, in terms of at least
>>>> latency and complexity.
>>>> This can be true if we advantage of state checkpointing (locally could
>>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>>> with an API to efficiently query data.
>>>> Some use cases I see:
>>>>
>>>> - real-time dashboards and real-time reporting, the faster the better
>>>> - monitoring of state for operational reasons, app health etc...
>>>> - integrating with external services via an API eg. making accessible
>>>>  aggregations over time windows to some third party service within your
>>>> system
>>>>
>>>> Regarding requirements here are some of them:
>>>> - support of an API to expose state (could be done at the spark
>>>> driver), like rest.
>>>> - supporting dynamic allocation (not sure how it affects state
>>>> management)
>>>> - an efficient way to talk to executors to get the state (rpc?)
>>>> - making local state more efficient and easier accessible with an
>>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>>> Some people are already working with such techs and some stuff could be
>>>> re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>>
>>>> Best,
>>>> Stavros
>>>>
>>>>
>>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-16738
>>>>>
>>>>> I don't believe anyone is working on it yet.  I think the most useful
>>>>> thing is to start enumerating requirements and use cases and then we can
>>>>> talk about how to build it.
>>>>>
>>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
>>>>> st.kontopou...@gmail.com> wrote:
>>>>>
>>>>>> Cool Burak do you have a poi

Re: Spark job status on Kubernetes

2019-03-13 Thread Stavros Kontopoulos
AFAIK completed can happen in case of failures as well, check here:
https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/client/conditions/conditions.go#L61

The phase of the pod should be `succeeded` to make a conclusion. This is
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/sparkapp_util.go#L75
how the spark operator uses that info to deduce the application status.

Stavros

On Wed, Mar 13, 2019 at 5:48 PM Chandu Kavar  wrote:

> Hi,
>
> We are running Spark jobs to Kubernetes (using Spark 2.4.0 and cluster
> mode). To get the status of the spark job we check the status of the driver
> pod (using Kubernetes REST API).
>
> Is it okay to assume that spark job is successful if the status of the
> driver pod is COMPLETED?
>
> Thanks,
> Chandu
>
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
Yes its a touch decision and as we discussed today (
https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
)
"Kubernetes support window is 9 months, Spark is two years". So we may end
up with old client versions on branches still supported like 2.4.x in the
future.
That gives us no choice but to upgrade, if we want to be on the safe side.
We have tested 3.0.0 with 1.11 internally and it works but I dont know what
it means to run with old
clients.


On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:

> If the old client is basically unusable with the versions of K8S
> people mostly use now, and the new client still works with older
> versions, I could see including this in 2.4.1.
>
> Looking at
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
> it seems like the 4.1.1 client is needed for 1.10 and above. However
> it no longer supports 1.7 and below.
> We have 3.0.x, and versions through 4.0.x of the client support the
> same K8S versions, so no real middle ground here.
>
> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
> branches are maintained for 9 months per
> https://kubernetes.io/docs/setup/version-skew-policy/
>
> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
> used the newer client from the start as at that point (?) 1.7 and
> earlier were already at least 7 months past EOL.
> If we update the client in 2.4.1, versions of K8S as recently
> 'supported' as a year ago won't work anymore. I'm guessing there are
> still 1.7 users out there? That wasn't that long ago but if the
> project and users generally move fast, maybe not.
>
> Normally I'd say, that's what the next minor release of Spark is for;
> update if you want later infra. But there is no Spark 2.5.
> I presume downstream distros could modify the dependency easily (?) if
> needed and maybe already do. It wouldn't necessarily help end users.
>
> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
> If it 'basically works but no guarantees' I'd favor not updating. If
> it doesn't work at all, hm. That's tough. I think I'd favor updating
> the client but think it's a tough call both ways.
>
>
>
> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>  wrote:
> >
> > Yes Shane Knapp has done the work for that already,  and also tests
> pass, I am working on a PR now, I could submit it for the 2.4 branch .
> > I understand that this is a major dependency update, but the problem I
> see is that the client version is so old that I dont think it makes
> > much sense for current users who are on k8s 1.10, 1.11 etc(
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
> 3.0.0 does not even exist in there).
> > I dont know what it means to use that old version with current k8s
> clusters in terms of bugs etc.
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
Yes Shane Knapp has done the work for that already,  and also tests pass, I
am working on a PR now, I could submit it for the 2.4 branch .
I understand that this is a major dependency update, but the problem I see
is that the client version is so old that I dont think it makes
much sense for current users who are on k8s 1.10, 1.11 etc(
https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 3.0.0
does not even exist in there).
I dont know what it means to use that old version with current k8s clusters
in terms of bugs etc.

On Wed, Mar 6, 2019 at 6:32 PM shane knapp  wrote:

> On Wed, Mar 6, 2019 at 7:17 AM Sean Owen  wrote:
>
>> The problem is that that's a major dependency upgrade in a maintenance
>> release. It didn't seem to work when we applied it to master. I don't
>> think it would block a release.
>>
>> i tested the k8s client 4.1.2 against master a couple of weeks back and
> it worked fine.  i will doubly confirm when i get in to the office today.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Stavros Kontopoulos
We need to resolve this https://issues.apache.org/jira/browse/SPARK-26742
as well for 2.4.1, to make k8s support meaningful as many people are now on
1.11+

Stavros

On Tue, Mar 5, 2019 at 3:12 PM Saisai Shao  wrote:

> Hi DB,
>
> I saw that we already have 6 RCs, but the vote I can search by now was
> RC2, were they all canceled?
>
> Thanks
> Saisai
>
> DB Tsai  于2019年2月22日周五 上午4:51写道:
>
>> I am cutting a new rc4 with fix from Felix. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0359BC9965359766
>>
>> On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung 
>> wrote:
>> >
>> > I merged the fix to 2.4.
>> >
>> >
>> > 
>> > From: Felix Cheung 
>> > Sent: Wednesday, February 20, 2019 9:34 PM
>> > To: DB Tsai; Spark dev list
>> > Cc: Cesar Delgado
>> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>> >
>> > Could you hold for a bit - I have one more fix to get in
>> >
>> >
>> > 
>> > From: d_t...@apple.com on behalf of DB Tsai 
>> > Sent: Wednesday, February 20, 2019 12:25 PM
>> > To: Spark dev list
>> > Cc: Cesar Delgado
>> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>> >
>> > Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>> >
>> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
>> Inc
>> >
>> > > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin
>>  wrote:
>> > >
>> > > Just wanted to point out that
>> > > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
>> > > and is marked as a correctness bug. (The fix is in the 2.4 branch,
>> > > just not in rc2.)
>> > >
>> > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai 
>> wrote:
>> > >>
>> > >> Please vote on releasing the following candidate as Apache Spark
>> version 2.4.1.
>> > >>
>> > >> The vote is open until Feb 24 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > >> a minimum of 3 +1 votes.
>> > >>
>> > >> [ ] +1 Release this package as Apache Spark 2.4.1
>> > >> [ ] -1 Do not release this package because ...
>> > >>
>> > >> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> > >>
>> > >> The tag to be voted on is v2.4.1-rc2 (commit
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> > >> https://github.com/apache/spark/tree/v2.4.1-rc2
>> > >>
>> > >> The release files, including signatures, digests, etc. can be found
>> at:
>> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>> > >>
>> > >> Signatures used for Spark RCs can be found in this file:
>> > >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > >>
>> > >> The staging repository for this release can be found at:
>> > >>
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>> > >>
>> > >> The documentation corresponding to this release can be found at:
>> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>> > >>
>> > >> The list of bug fixes going into 2.4.1 can be found at the following
>> URL:
>> > >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> > >>
>> > >> FAQ
>> > >>
>> > >> =
>> > >> How can I help test this release?
>> > >> =
>> > >>
>> > >> If you are a Spark user, you can help us test this release by taking
>> > >> an existing Spark workload and running on this release candidate,
>> then
>> > >> reporting any regressions.
>> > >>
>> > >> If you're working in PySpark you can set up a virtual env and install
>> > >> the current RC and see if anything important breaks, in the
>> Java/Scala
>> > >> you can add the staging repository to your projects resolvers and
>> test
>> > >> with the RC (make sure to clean up the artifact cache before/after so
>> > >> you don't end up building with a out of date RC going forward).
>> > >>
>> > >> ===
>> > >> What should happen to JIRA tickets still targeting 2.4.1?
>> > >> ===
>> > >>
>> > >> The current list of open tickets targeted at 2.4.1 can be found at:
>> > >> https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.4.1
>> > >>
>> > >> Committers should look at those and triage. Extremely important bug
>> > >> fixes, documentation, and API tweaks that impact compatibility should
>> > >> be worked on immediately. Everything else please retarget to an
>> > >> appropriate release.
>> > >>
>> > >> ==
>> > >> But my bug isn't fixed?
>> > >> ==
>> > >>
>> > >> In order to make timely releases, we will typically not hold the
>> > >> release unless the bug in question is a regression from the previous
>> > >> release. That being said, if there is something which is a regression
>> > >> that has not been correctly targeted please ping me or a committer to
>> > >> help target the issue.
>> > >>
>> > >>
>> > >> DB Tsai | Siri Open Source 

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Stavros Kontopoulos
Thanks Ryan!

On Tue, Mar 5, 2019 at 7:19 PM Ryan Blue  wrote:

> Everyone is welcome to join this discussion. Just send me an e-mail to get
> added to the invite.
>
> Stavros, I'll add you.
>
> rb
>
> On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Thanks for the update, is this meeting open for other people to join?
>>
>> Stavros
>>
>> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue 
>> wrote:
>>
>>> Here are my notes from the DSv2 sync last night. As always, if you have
>>> corrections, please reply with them. And if you’d like to be included on
>>> the invite to participate in the next sync (6 March), send me an email.
>>>
>>> Here’s a quick summary of the topics where we had consensus last night:
>>>
>>>- The behavior of v1 sources needs to be documented to come up with
>>>a migration plan
>>>- Spark 3.0 should include DSv2, even if it would delay the release
>>>(pending community discussion and vote)
>>>- Design for the v2 Catalog plugin system
>>>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>>>ViewCatalog interfaces
>>>- Common v2 Table metadata should be schema, partitioning, and
>>>string-map of properties; leaving out sorting for now. (Ready to vote on
>>>metadata SPIP.)
>>>
>>> *Topics*:
>>>
>>>- Issues raised by ORC v2 commit
>>>- Migration to v2 sources
>>>- Roadmap and current blockers
>>>- Catalog plugin system
>>>- Catalog API separate interfaces approach
>>>- Catalog API metadata (schema, partitioning, and properties)
>>>- Public catalog API proposal
>>>
>>> *Notes*:
>>>
>>>- Issues raised by ORC v2 commit
>>>   - Ryan: Disabled change to use v2 by default in PR for overwrite
>>>   plans: tests rely on CTAS, which is not implemented in v2.
>>>   - Wenchen: suggested using a StagedTable to work around not
>>>   having a CTAS finished. TableProvider could create a staged table.
>>>   - Ryan: Using StagedTable doesn’t make sense to me. It was
>>>   intended to solve a different problem (atomicity). Adding an 
>>> interface to
>>>   create a staged table either requires the same metadata as CTAS or 
>>> requires
>>>   a blank staged table, which isn’t the same concept: these staged 
>>> tables
>>>   would behave entirely differently than the ones for atomic operations.
>>>   Better to spend time getting CTAS done and work through the long-term 
>>> plan
>>>   than to hack around it.
>>>   - Second issue raised by the ORC work: how to support tables that
>>>   use different validations.
>>>   - Ryan: What Gengliang’s PRs are missing is a clear definition of
>>>   what tables require different validation and what that validation 
>>> should
>>>   be. In some cases, CTAS is validated against existing data [Ed: this 
>>> is
>>>   PreprocessTableCreation] and in some cases, Append has no validation
>>>   because the table doesn’t exist. What isn’t clear is when these 
>>> validations
>>>   are applied.
>>>   - Ryan: Without knowing exactly how v1 works, we can’t mirror
>>>   that behavior in v2. Building a way to turn off validation is going 
>>> to be
>>>   needed, but is insufficient without knowing when to apply it.
>>>   - Ryan: We also don’t know if it will make sense to maintain all
>>>   of these rules to mimic v1 behavior. In v1, CTAS and Append can both 
>>> write
>>>   to existing tables, but use different rules to validate. What are the
>>>   differences between them? It is unlikely that Spark will support both 
>>> as
>>>   options, if that is even possible. [Ed: see later discussion on 
>>> migration
>>>   that continues this.]
>>>   - Gengliang: Using SaveMode is an option.
>>>   - Ryan: Using SaveMode only appears to fix this, but doesn’t
>>>   actually test v2. Using SaveMode appears to work because it disables 
>>> all
>>>   validation and uses code from v1 that will “create” tables by 
>>> writing. But
>>>   this isn’t helpful for the v2 goal of having defined and reliable 
>>> behavior.
>>>   - Gengliang: SaveMode is not correctly transla

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Stavros Kontopoulos
Thanks for the update, is this meeting open for other people to join?

Stavros

On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue 
wrote:

> Here are my notes from the DSv2 sync last night. As always, if you have
> corrections, please reply with them. And if you’d like to be included on
> the invite to participate in the next sync (6 March), send me an email.
>
> Here’s a quick summary of the topics where we had consensus last night:
>
>- The behavior of v1 sources needs to be documented to come up with a
>migration plan
>- Spark 3.0 should include DSv2, even if it would delay the release
>(pending community discussion and vote)
>- Design for the v2 Catalog plugin system
>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>ViewCatalog interfaces
>- Common v2 Table metadata should be schema, partitioning, and
>string-map of properties; leaving out sorting for now. (Ready to vote on
>metadata SPIP.)
>
> *Topics*:
>
>- Issues raised by ORC v2 commit
>- Migration to v2 sources
>- Roadmap and current blockers
>- Catalog plugin system
>- Catalog API separate interfaces approach
>- Catalog API metadata (schema, partitioning, and properties)
>- Public catalog API proposal
>
> *Notes*:
>
>- Issues raised by ORC v2 commit
>   - Ryan: Disabled change to use v2 by default in PR for overwrite
>   plans: tests rely on CTAS, which is not implemented in v2.
>   - Wenchen: suggested using a StagedTable to work around not having
>   a CTAS finished. TableProvider could create a staged table.
>   - Ryan: Using StagedTable doesn’t make sense to me. It was intended
>   to solve a different problem (atomicity). Adding an interface to create 
> a
>   staged table either requires the same metadata as CTAS or requires a 
> blank
>   staged table, which isn’t the same concept: these staged tables would
>   behave entirely differently than the ones for atomic operations. Better 
> to
>   spend time getting CTAS done and work through the long-term plan than to
>   hack around it.
>   - Second issue raised by the ORC work: how to support tables that
>   use different validations.
>   - Ryan: What Gengliang’s PRs are missing is a clear definition of
>   what tables require different validation and what that validation should
>   be. In some cases, CTAS is validated against existing data [Ed: this is
>   PreprocessTableCreation] and in some cases, Append has no validation
>   because the table doesn’t exist. What isn’t clear is when these 
> validations
>   are applied.
>   - Ryan: Without knowing exactly how v1 works, we can’t mirror that
>   behavior in v2. Building a way to turn off validation is going to be
>   needed, but is insufficient without knowing when to apply it.
>   - Ryan: We also don’t know if it will make sense to maintain all of
>   these rules to mimic v1 behavior. In v1, CTAS and Append can both write 
> to
>   existing tables, but use different rules to validate. What are the
>   differences between them? It is unlikely that Spark will support both as
>   options, if that is even possible. [Ed: see later discussion on 
> migration
>   that continues this.]
>   - Gengliang: Using SaveMode is an option.
>   - Ryan: Using SaveMode only appears to fix this, but doesn’t
>   actually test v2. Using SaveMode appears to work because it disables all
>   validation and uses code from v1 that will “create” tables by writing. 
> But
>   this isn’t helpful for the v2 goal of having defined and reliable 
> behavior.
>   - Gengliang: SaveMode is not correctly translated. Append could
>   mean AppendData or CTAS.
>   - Ryan: This is why we need to focus on finishing the v2 plans: so
>   we can correctly translate the SaveMode into the right plan. That 
> depends
>   on having a catalog for CTAS and to check the existence of a table.
>   - Wenchen: Catalog doesn’t support path tables, so how does this
>   help?
>   - Ryan: The multi-catalog identifiers proposal includes a way to
>   pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows 
> a
>   catalog implementation to handle path-based tables. The identifier will
>   also have a method to test whether the identifier is a path identifier 
> and
>   catalogs are not required to support path identifiers.
>- Migration to v2 sources
>   - Hyukjin: Once the ORC upgrade is done how will we move from v1 to
>   v2?
>   - Ryan: We will need to develop v1 and v2 in parallel. There are
>   many code paths in v1 and we don’t know exactly what they do. We first 
> need
>   to know what they do and make a migration plan after that.
>   - Hyukjin: What if there are many behavior differences? Will this
>   require an API to opt in for each one?
>   - Ryan: Without knowing how v1 

Re: Welcome Jose Torres as a Spark committer

2019-01-30 Thread Stavros Kontopoulos
Congrats Jose!

On Wed, Jan 30, 2019 at 10:44 AM Gabor Somogyi 
wrote:

> Congrats Jose!
>
> BR,
> G
>
> On Wed, Jan 30, 2019 at 9:05 AM Nuthan Reddy 
> wrote:
>
>> Congrats Jose,
>>
>> Regards,
>> Nuthan Reddy
>>
>>
>>
>> On Wed, Jan 30, 2019 at 1:22 PM Marco Gaido 
>> wrote:
>>
>>> Congrats, Jose!
>>>
>>> Bests,
>>> Marco
>>>
>>> Il giorno mer 30 gen 2019 alle ore 03:17 JackyLee  ha
>>> scritto:
>>>
 Congrats, Joe!

 Best,
 Jacky



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: [Discussion] Clarification regarding Stateful Aggregations over Structured Streaming

2018-12-16 Thread Stavros Kontopoulos
Hi,

Databricks runtime as you already know has this enhancement and so it is
considered a good option if you want to decouple state from the jvm.
Some arguments why to do so are given by the Flink paper along with
incremental snapshotting: http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf.
Also timers implemented in RockDb can give you higher scalability with very
large states (and many timers). I am not aware of the history behind the
FMGWS API (others could provide more info), but I was also looking at the
API recently thinking of an API for this:
https://issues.apache.org/jira/browse/SPARK-16738

Best,
Stavros

On Sun, Dec 16, 2018 at 7:58 PM Chitral Verma 
wrote:

> Hi Devs,
>
> For quite some time i've been looking at the structured streaming API to
> solve lots of use cases at my workplace, I've have some doubts I wanted to
> clarify regarding stateful aggregations over structured streaming.
>
> Currently, spark provides flatMapGroupWithState (FMGWS) /
> mapGroupWithState (MGWS) APIs to allow custom streaming aggregations by
> setting/ updating intermediate `GroupedState` which may or may not expire.
> This GroupedState is stored in form of snapshots and the latest snapshot is
> entirely in memory, what might be memory consuming approach and may result
> in OOMs.
>
> Other than this, in my opinion, FMGWS is not very flexible in terms of
> usage (aggregation logic and needs to be written on Rows and spark sql
> inbuilt functions can be used) and the timeouts require query to progress
> in order expire keys.
>
> To remedy this i have contributed to this project
>  which basically moves the
> expiration logic to state store (rocks db) and the state store is no longer
> managed by the executor jvm allowing true expiration of state with nano sec
> precision.
>
> My question is, is there a specific reason FMGWS API is designed the way
> it is and are there any down sides to the approach I have mentioned above.
>
> Do let me know you thoughts.
>
> Thanks
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Stavros Kontopoulos
Awesome!

On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>>


 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 


 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
 adds Barrier Execution Mode for better integration with deep learning
 frameworks, introduces 30+ built-in and higher-order functions to deal with
 complex data type easier, improves the K8s integration, along with
 experimental Scala 2.12 support. Other major updates include the built-in
 Avro data source, Image data source, flexible streaming sinks, elimination
 of the 2GB block size limitation during transfer, Pandas UDF improvements.
 In addition, this release continues to focus on usability, stability, and
 polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their contributions
 and early feedback to this release. This release would not have been
 possible without you.

 To download Spark 2.4.0, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes: https://spark.apache.org/
 releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or published
 artifacts, please contact me directly off-list.

>>>


Re: Test and support only LTS JDK release?

2018-11-07 Thread Stavros Kontopoulos
Red Hat:
https://access.redhat.com/articles/1299013#OpenJDK_Lifecycle_Dates_and_RHEL_versions

Stavros

On Wed, Nov 7, 2018 at 12:13 PM, Kazuaki Ishizaki 
wrote:

> This entry includes a good figure for support lifecycle.
> https://www.azul.com/products/zulu-and-zulu-enterprise/zulu-
> enterprise-java-support-options/
>
> Kazuaki Ishizaki,
>
>
>
> From:Marcelo Vanzin 
> To:Felix Cheung 
> Cc:Ryan Blue , sn...@snazy.de, dev <
> dev@spark.apache.org>, Cesar Delgado 
> Date:2018/11/07 08:29
> Subject:Re: Test and support only LTS JDK release?
> --
>
>
>
> https://www.oracle.com/technetwork/java/javase/eol-135779.html
> On Tue, Nov 6, 2018 at 2:56 PM Felix Cheung 
> wrote:
> >
> > Is there a list of LTS release that I can reference?
> >
> >
> > 
> > From: Ryan Blue 
> > Sent: Tuesday, November 6, 2018 1:28 PM
> > To: sn...@snazy.de
> > Cc: Spark Dev List; cdelg...@apple.com
> > Subject: Re: Test and support only LTS JDK release?
> >
> > +1 for supporting LTS releases.
> >
> > On Tue, Nov 6, 2018 at 11:48 AM Robert Stupp  wrote:
> >>
> >> +1 on supporting LTS releases.
> >>
> >> VM distributors (RedHat, Azul - to name two) want to provide patches to
> LTS versions (i.e. into http://hg.openjdk.java.net/jdk-updates/jdk11u/).
> How that will play out in reality ... I don't know. Whether Oracle will
> contribute to that repo for 8 after it's EOL and 11 after the 6 month cycle
> ... we will see. Most Linux distributions promised(?) long-term support for
> Java 11 in their LTS releases (e.g. Ubuntu 18.04). I am not sure what that
> exactly means ... whether they will actively provide patches to OpenJDK or
> whether they just build from source.
> >>
> >> But considering that, I think it's definitely worth to at least keep an
> eye on Java 12 and 13 - even if those are just EA. Java 12 for example does
> already forbid some "dirty tricks" that are still possible in Java 11.
> >>
> >>
> >> On 11/6/18 8:32 PM, DB Tsai wrote:
> >>
> >> OpenJDK will follow Oracle's release cycle, https://openjdk.java.net/
> projects/jdk/, a strict six months model. I'm not familiar with other
> non-Oracle VMs and Redhat support.
> >>
> >> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
> >>
> >> On Nov 6, 2018, at 11:26 AM, Reynold Xin  wrote:
> >>
> >> What does OpenJDK do and other non-Oracle VMs? I know there was a lot
> of discussions from Redhat etc to support.
> >>
> >>
> >> On Tue, Nov 6, 2018 at 11:24 AM DB Tsai  wrote:
> >>>
> >>> Given Oracle's new 6-month release model, I feel the only realistic
> option is to only test and support JDK such as JDK 11 LTS and future LTS
> release. I would like to have a discussion on this in Spark community.
> >>>
> >>> Thanks,
> >>>
> >>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
> >>>
> >>
> >> --
> >> Robert Stupp
> >> @snazy
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
>


Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Stavros Kontopoulos
@Michael any update about queryable state?

Stavros

On Tue, Oct 30, 2018 at 10:43 PM, Michael Armbrust 
wrote:

> Thanks for bringing up some possible future directions for streaming. Here
> are some thoughts:
>  - I personally view all of the activity on Spark SQL also as activity on
> Structured Streaming. The great thing about building streaming on catalyst
> / tungsten is that continued improvement to these components improves
> streaming use cases as well.
>  - I think the biggest on-going project is DataSourceV2, whose goal is to
> provide a stable / performant API for streaming and batch data sources to
> plug in.  I think connectivity to many different systems is one of the most
> powerful aspects of Spark and right now there is no stable public API for
> streaming. A lot of committer / PMC time is being spent here at the moment.
>  - As you mention, 2.4.0 significantly improves the built in connectivity
> for Kafka, giving us the ability to read exactly once from a topic being
> written to transactional producers. I think projects to extend this
> guarantee to the Kafka Sink and also to improve authentication with Kafka
> are a great idea (and it seems like there is a lot of review activity on
> the latter).
>
> You bring up some other possible projects like session window support.
> This is an interesting project, but as far as I can tell it still there is
> still a lot of work that would need to be done before this feature could be
> merged.  We'd need to understand how it works with update mode amongst
> other things. Additionally, a 3000+ line patch is really time consuming to
> review. This coupled with the fact that all the users that I have
> interacted with need "session windows + some custom business logic"
> (usually implemented with flatMapGroupsWithState), mean that I'm more
> inclined to direct limited review bandwidth to incremental improvements in
> that feature than to something large/new. This is not to say that this
> feature isn't useful / shouldn't be merge, just a bit of explanation as to
> why there might be less activity here than you would hope.
>
> Similarly, multiple aggregations are an often requested feature.  However,
> fundamentally, this is going to be a fairly large investment (I think we'd
> need to combine the unsupported operation checker and the query planner and
> also create a high performance (i.e. whole stage code-gened) aggregation
> operator that understands negation).
>
> Thanks again for starting the discussion, and looking forward to hearing
> about what features are most requested!
>
> On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim  wrote:
>
>> Adding more: again, it doesn't mean they're feasible to do. Just a kind
>> of brainstorming.
>>
>> * SPARK-20568: Delete files after processing in structured streaming
>>   * There hasn't been consensus regarding supporting this: there were
>> voices for both YES and NO.
>> * Support multiple levels of aggregations in structured streaming
>>   * There're plenty of questions in SO regarding this. While I don't
>> think it makes sense on structured streaming if it requires additional
>> shuffle, there might be another case: group by keys, apply aggregation,
>> apply aggregation on aggregated result (grouped keys don't change)
>>
>> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim 님이 작성:
>>
>>> Yeah, the main intention of this thread is to collect interest on
>>> possible feature list for structured streaming. From what I can see in
>>> Spark community, most of the discussions as well as contributions are for
>>> SQL, and I'd wish to see similar activeness / efforts on structured
>>> streaming.
>>> (Unfortunately there's less effort to review others' works - design doc
>>> as well as pull request - most of efforts looks like being spent to their
>>> own works.)
>>>
>>> I respect the role of PMC member, so the final decision would be up to
>>> PMC members, but contributors as well as end users could show the interest
>>> as well as discuss about requirements on SPIP, which could be a good
>>> background to persuade PMC members.
>>>
>>> Before going into the deep I guess we could use this thread to discuss
>>> about possible use cases, and if we would like to move forward to
>>> individual thread we could initiate (or resurrect) its discussion thread.
>>>
>>> For queryable state, at least there seems no workaround in Spark to
>>> provide similar thing, especially state is getting bigger. I may have some
>>> concerns on the details, but I'll add my thought on the discussion thread.
>>>
>>> - Jungtaek Lim

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-26 Thread Stavros Kontopoulos
Sean,

Yes, I updated the PR and re-run it.

On Fri, Oct 26, 2018 at 2:54 AM, Sean Owen  wrote:

> Yep, we're going to merge a change to separate the k8s tests into a
> separate profile, and fix up the Scala 2.12 thing. While non-critical those
> are pretty nice to have for 2.4. I think that's doable within the next 12
> hours even.
>
> @skonto I think there's one last minor thing needed on this PR?
> https://github.com/apache/spark/pull/22838/files#r228363727
>
> On Thu, Oct 25, 2018 at 6:42 PM Wenchen Fan  wrote:
>
>> Any updates on this topic? https://github.com/apache/spark/pull/22827 is
>> merged and 2.4 is unblocked.
>>
>> I'll cut RC5 shortly after the weekend, and it will be great to include
>> the change proposed here.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> I think it's worth getting in a change to just not enable this module,
>>>> which ought to be entirely safe, and avoid two of the issues we
>>>> identified.
>>>>
>>>
>>> Besides disabling it, when someone wants to run the tests with 2.12 he
>>> should be able to do so. So propagating the Scala profile still makes sense
>>> but it is not related to the release other than making sure things work
>>> fine.
>>>
>>> On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:
>>>
>>>> I think it's worth getting in a change to just not enable this module,
>>>> which ought to be entirely safe, and avoid two of the issues we
>>>> identified.
>>>> that said it didn't block RC4 so need not block RC5.
>>>> But should happen today if we're doing it.
>>>> On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
>>>> >
>>>> > Hopefully, this will not delay RC5. Since this is not a blocker
>>>> ticket, RC5 will start if all the blocker tickets are resolved.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Xiao
>>>> >
>>>> > Sean Owen  于2018年10月25日周四 上午8:44写道:
>>>> >>
>>>> >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0
>>>> RC5 :)
>>>> >>
>>>> >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
>>>> >>  wrote:
>>>> >> >
>>>> >> > I agree these tests should be manual for now but should be run
>>>> somehow before a release to make sure things are working right?
>>>> >> >
>>>> >> > For the other issue: https://issues.apache.org/
>>>> jira/browse/SPARK-25835 .
>>>> >> >
>>>> >> >
>>>> >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
>>>> stavros.kontopou...@lightbend.com> wrote:
>>>> >> >>
>>>> >> >> I will open a jira for the profile propagation issue and have a
>>>> look to fix it.
>>>> >> >>
>>>> >> >> Stavros
>>>> >> >>
>>>> >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <
>>>> eerla...@redhat.com> wrote:
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I would be comfortable making the integration testing manual for
>>>> now.  A JIRA for ironing out how to make it reliable for automatic as a
>>>> goal for 3.0 seems like a good idea.
>>>> >> >>>
>>>> >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen 
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> Forking this thread.
>>>> >> >>>>
>>>> >> >>>> Because we'll have another RC, we could possibly address these
>>>> two
>>>> >> >>>> issues. Only if we have a reliable change of course.
>>>> >> >>>>
>>>> >> >>>> Is it easy enough to propagate the -Pscala-2.12 profile? can't
>>>> hurt.
>>>> >> >>>>
>>>> >> >>>> And is it reasonable to essentially 'disable'
>>>> >> >>>> kubernetes/integration-tests by removing it from the kubernetes
>>>> >> >>>> profile? it doesn't mean it goes away, just means it's run
>>>> 

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
>
> I think it's worth getting in a change to just not enable this module,
> which ought to be entirely safe, and avoid two of the issues we
> identified.
>

Besides disabling it, when someone wants to run the tests with 2.12 he
should be able to do so. So propagating the Scala profile still makes sense
but it is not related to the release other than making sure things work
fine.

On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:

> I think it's worth getting in a change to just not enable this module,
> which ought to be entirely safe, and avoid two of the issues we
> identified.
> that said it didn't block RC4 so need not block RC5.
> But should happen today if we're doing it.
> On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
> >
> > Hopefully, this will not delay RC5. Since this is not a blocker ticket,
> RC5 will start if all the blocker tickets are resolved.
> >
> > Thanks,
> >
> > Xiao
> >
> > Sean Owen  于2018年10月25日周四 上午8:44写道:
> >>
> >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5
> :)
> >>
> >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
> >>  wrote:
> >> >
> >> > I agree these tests should be manual for now but should be run
> somehow before a release to make sure things are working right?
> >> >
> >> > For the other issue: https://issues.apache.org/
> jira/browse/SPARK-25835 .
> >> >
> >> >
> >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
> >> >>
> >> >> I will open a jira for the profile propagation issue and have a look
> to fix it.
> >> >>
> >> >> Stavros
> >> >>
> >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson 
> wrote:
> >> >>>
> >> >>>
> >> >>> I would be comfortable making the integration testing manual for
> now.  A JIRA for ironing out how to make it reliable for automatic as a
> goal for 3.0 seems like a good idea.
> >> >>>
> >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
> >> >>>>
> >> >>>> Forking this thread.
> >> >>>>
> >> >>>> Because we'll have another RC, we could possibly address these two
> >> >>>> issues. Only if we have a reliable change of course.
> >> >>>>
> >> >>>> Is it easy enough to propagate the -Pscala-2.12 profile? can't
> hurt.
> >> >>>>
> >> >>>> And is it reasonable to essentially 'disable'
> >> >>>> kubernetes/integration-tests by removing it from the kubernetes
> >> >>>> profile? it doesn't mean it goes away, just means it's run
> manually,
> >> >>>> not automatically. Is that actually how it's meant to be used
> anyway?
> >> >>>> in the short term? given the discussion around its requirements and
> >> >>>> minikube and all that?
> >> >>>>
> >> >>>> (Actually, this would also 'solve' the Scala 2.12 build problem
> too)
> >> >>>>
> >> >>>> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
> wrote:
> >> >>>> >
> >> >>>> > To be clear I'm currently +1 on this release, with much
> commentary.
> >> >>>> >
> >> >>>> > OK, the explanation for kubernetes tests makes sense. Yes I
> think we need to propagate the scala-2.12 build profile to make it work. Go
> for it, if you have a lead on what the change is.
> >> >>>> > This doesn't block the release as it's an issue for tests, and
> only affects 2.12. However if we had a clean fix for this and there were
> another RC, I'd include it.
> >> >>>> >
> >> >>>> > Dongjoon has a good point about the 
> >> >>>> > spark-kubernetes-integration-tests
> artifact. That doesn't sound like it should be published in this way,
> though, of course, we publish the test artifacts from every module already.
> This is only a bit odd in being a non-test artifact meant for testing. But
> it's special testing! So I also don't think that needs to block a release.
> >> >>>> >
> >> >>>> > This happens because the integration tests module is enabled
> with the 'kubernetes' profile too, and also this output is copied into the
> release tarball at kubernetes/integration-tests/tests. Do we need that in
> a binary release?
> >> >>>> >
> >> >>>> > If these integration tests are meant to be run ad hoc, manually,
> not part of a normal test cycle, then I think we can just not enable it
> with -Pkubernetes. If it is meant to run every time, then it sounds like we
> need a little extra work shown in recent PRs to make that easier, but then,
> this test code should just be the 'test' artifact parts of the kubernetes
> module, no?
> >> >>>>
> >> >>>> 
> -
> >> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >> >>>>
> >> >>
> >> >>
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>



-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
I agree these tests should be manual for now but should be run somehow
before a release to make sure things are working right?

For the other issue: https://issues.apache.org/jira/browse/SPARK-25835 .


On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> I will open a jira for the profile propagation issue and have a look to
> fix it.
>
> Stavros
>
> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson 
> wrote:
>
>>
>> I would be comfortable making the integration testing manual for now.  A
>> JIRA for ironing out how to make it reliable for automatic as a goal for
>> 3.0 seems like a good idea.
>>
>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
>>
>>> Forking this thread.
>>>
>>> Because we'll have another RC, we could possibly address these two
>>> issues. Only if we have a reliable change of course.
>>>
>>> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>>>
>>> And is it reasonable to essentially 'disable'
>>> kubernetes/integration-tests by removing it from the kubernetes
>>> profile? it doesn't mean it goes away, just means it's run manually,
>>> not automatically. Is that actually how it's meant to be used anyway?
>>> in the short term? given the discussion around its requirements and
>>> minikube and all that?
>>>
>>> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>>>
>>> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>>> >
>>> > To be clear I'm currently +1 on this release, with much commentary.
>>> >
>>> > OK, the explanation for kubernetes tests makes sense. Yes I think we
>>> need to propagate the scala-2.12 build profile to make it work. Go for it,
>>> if you have a lead on what the change is.
>>> > This doesn't block the release as it's an issue for tests, and only
>>> affects 2.12. However if we had a clean fix for this and there were another
>>> RC, I'd include it.
>>> >
>>> > Dongjoon has a good point about the spark-kubernetes-integration-tests
>>> artifact. That doesn't sound like it should be published in this way,
>>> though, of course, we publish the test artifacts from every module already.
>>> This is only a bit odd in being a non-test artifact meant for testing. But
>>> it's special testing! So I also don't think that needs to block a release.
>>> >
>>> > This happens because the integration tests module is enabled with the
>>> 'kubernetes' profile too, and also this output is copied into the release
>>> tarball at kubernetes/integration-tests/tests. Do we need that in a
>>> binary release?
>>> >
>>> > If these integration tests are meant to be run ad hoc, manually, not
>>> part of a normal test cycle, then I think we can just not enable it with
>>> -Pkubernetes. If it is meant to run every time, then it sounds like we need
>>> a little extra work shown in recent PRs to make that easier, but then, this
>>> test code should just be the 'test' artifact parts of the kubernetes
>>> module, no?
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
I will open a jira for the profile propagation issue and have a look to fix
it.

Stavros

On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson  wrote:

>
> I would be comfortable making the integration testing manual for now.  A
> JIRA for ironing out how to make it reliable for automatic as a goal for
> 3.0 seems like a good idea.
>
> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
>
>> Forking this thread.
>>
>> Because we'll have another RC, we could possibly address these two
>> issues. Only if we have a reliable change of course.
>>
>> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>>
>> And is it reasonable to essentially 'disable'
>> kubernetes/integration-tests by removing it from the kubernetes
>> profile? it doesn't mean it goes away, just means it's run manually,
>> not automatically. Is that actually how it's meant to be used anyway?
>> in the short term? given the discussion around its requirements and
>> minikube and all that?
>>
>> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>>
>> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>> >
>> > To be clear I'm currently +1 on this release, with much commentary.
>> >
>> > OK, the explanation for kubernetes tests makes sense. Yes I think we
>> need to propagate the scala-2.12 build profile to make it work. Go for it,
>> if you have a lead on what the change is.
>> > This doesn't block the release as it's an issue for tests, and only
>> affects 2.12. However if we had a clean fix for this and there were another
>> RC, I'd include it.
>> >
>> > Dongjoon has a good point about the spark-kubernetes-integration-tests
>> artifact. That doesn't sound like it should be published in this way,
>> though, of course, we publish the test artifacts from every module already.
>> This is only a bit odd in being a non-test artifact meant for testing. But
>> it's special testing! So I also don't think that needs to block a release.
>> >
>> > This happens because the integration tests module is enabled with the
>> 'kubernetes' profile too, and also this output is copied into the release
>> tarball at kubernetes/integration-tests/tests. Do we need that in a
>> binary release?
>> >
>> > If these integration tests are meant to be run ad hoc, manually, not
>> part of a normal test cycle, then I think we can just not enable it with
>> -Pkubernetes. If it is meant to run every time, then it sounds like we need
>> a little extra work shown in recent PRs to make that easier, but then, this
>> test code should just be the 'test' artifact parts of the kubernetes
>> module, no?
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
+1 (non-binding). Run k8s tests with Scala 2.12. Also included the
RTestsSuite (mentioned by Ilan) although not part of the 2.4 rc tag:

[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 239 milliseconds.
Run starting. Expected test count is: 15
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Run SparkR on simple dataframe.R example
Run completed in 6 minutes, 32 seconds.
Total number of tests run: 15
Suites: completed 2, aborted 0
Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
4.480 s]
[INFO] Spark Project Tags . SUCCESS [
3.898 s]
[INFO] Spark Project Local DB . SUCCESS [
2.773 s]
[INFO] Spark Project Networking ... SUCCESS [
5.063 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
2.651 s]
[INFO] Spark Project Unsafe ... SUCCESS [
2.662 s]
[INFO] Spark Project Launcher . SUCCESS [
5.103 s]
[INFO] Spark Project Core . SUCCESS [
25.703 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [06:51
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:44 min
[INFO] Finished at: 2018-10-23T19:09:41Z
[INFO]


Stavros

On Tue, Oct 23, 2018 at 9:46 PM, Dongjoon Hyun 
wrote:

> BTW, for that integration suite, I saw the related artifacts in the RC4
> staging directory.
>
> Does Spark 2.4.0 need to start to release these `spark-kubernetes
> -integration-tests` artifacts?
>
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.11/
>
> <https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/>
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.12/
>
> <https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/>
>
> Historically, Spark released `spark-docker-integration-tests` at Spark
> 1.6.x era and stopped since Spark 2.0.0.
>
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.10/
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.11/
>
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Sean,
>>
>> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
>> using the related tag v2.4.0-rc4:
>>
>> ./dev/change-scala-version.sh 2.12
>> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
>> -Phadoop-2.7 -Pkubernetes -Phive
>> Pushed images to dockerhub (previous email) since I didnt use the
>> minikube daemon (default behavior).
>>
>> Then run tests successfully against minikube:
>>
>> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
>> cd resource-managers/kubernetes/integration-tests
>>
>> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH
>> --service-account default --namespace default --image-tag k8s-scala-12 
>> --image-repo
>> skonto
>>
>>
>> [INFO]
>> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
>> spark-kubernetes-integration-tests_2.12 ---
>> Discovery starting.
>> Discovery completed in 229 milliseconds.
>> Run starting. Expected test count is: 14
>> KubernetesSuite:
>> - Run SparkPi with no resources
>> - Run SparkPi with a very long application name.
>> - Use SparkLauncher.NO_RESO

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
it still needs to work in this context.
>>>> >
>>>> > I just added -Pkubernetes to my build and didn't do anything else. I
>>>> think the ideal is that a "mvn -P... -P... install" to work from a source
>>>> release; that's a good expectation and consistent with docs.
>>>> >
>>>> > Maybe these tests simply don't need to run with the normal suite of
>>>> tests, and can be considered tests run manually by developers running these
>>>> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>>>> >
>>>> > I don't think this has to block the release even if so, just trying
>>>> to get to the bottom of it.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean,

I will try it against 2.12 shortly.

You're saying someone would have to first build a k8s distro from source
> too?


Ok I missed the error one line above, before the distro error there is
another one:

fatal: not a git repository (or any of the parent directories): .git


So that seems to come from here
<https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh#L19>.
It seems that the test root is not set up correctly. It should be the top
git dir from which you built Spark.

Now regarding the distro thing. dev-run-integration-tests.sh should run
from within the cloned project after the distro is built. The distro is
required
<https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/scripts/setup-integration-test-env.sh#L61>
, it should fail otherwise.

Integration tests run the setup-integration-test-env.sh script.
dev-run-integration-tests.sh
calls mvn
<https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L106>
which
in turn executes that setup script
<https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/pom.xml#L80>
.

How do you run the tests?

Stavros

On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:

> No, because the docs are built into the release too and released to
> the site too from the released artifact.
> As a practical matter, I think these docs are not critical for
> release, and can follow in a maintenance release. I'd retarget to
> 2.4.1 or untarget.
> I do know at times a release's docs have been edited after the fact,
> but that's bad form. We'd not go change a class in the release after
> it was released and call it the same release.
>
> I'd still like some confirmation that someone can build and pass tests
> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
> I don't think it's a 2.12 incompatibility, but rather than the K8S
> tests maybe don't quite work with the 2.12 build artifact naming. Or
> else something to do with my env.
>
> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
> >
> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
> after release and publish doc to spark website later. Can anyone confirm?
> >
> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
> >>
> >> This is what I got from a straightforward build of the source distro
> >> here ... really, ideally, it builds as-is from source. You're saying
> >> someone would have to first build a k8s distro from source too?
> >> It's not a 'must' that this be automatic but nothing else fails out of
> the box.
> >> I feel like I might be misunderstanding the setup here.
> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
> >>  wrote:
>



-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Stavros Kontopoulos
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> scripts/setup-integration-test-env.sh: line 85:
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/
> integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:


It seems you are missing the distro file... here is how I run it locally:

DOCKER_USERNAME=...
SPARK_K8S_IMAGE_TAG=...

./dev/make-distribution.sh --name test --tgz -Phadoop-2.7 -Pkubernetes
-Phive
tar -zxvf spark-2.4.0-SNAPSHOT-bin-test.tgz
cd spark-2.4.0-SNAPSHOT-bin-test
./bin/docker-image-tool.sh -r $DOCKER_USERNAME -t $SPARK_K8S_IMAGE_TAG build
cd ..
TGZ_PATH=$(pwd)/spark-2.4.0-SNAPSHOT-bin-test.tgz
cd resource-managers/kubernetes/integration-tests
./dev/dev-run-integration-tests.sh --image-tag $SPARK_K8S_IMAGE_TAG
--spark-tgz $TGZ_PATH --image-repo $DOCKER_USERNAME

Stavros

On Tue, Oct 23, 2018 at 1:54 AM, Sean Owen  wrote:

> Provisionally looking good to me, but I had a few questions.
>
> We have these open for 2.4, but I presume they aren't actually going
> to be in 2.4 and should be untargeted:
>
> SPARK-25507 Update documents for the new features in 2.4 release
> SPARK-25179 Document the features that require Pyarrow 0.10
> SPARK-25783 Spark shell fails because of jline incompatibility
> SPARK-25347 Document image data source in doc site
> SPARK-25584 Document libsvm data source in doc site
> SPARK-25346 Document Spark builtin data sources
> SPARK-24464 Unit tests for MLlib's Instrumentation
> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite.
> "receiver_life_cycle"
> SPARK-22809 pyspark is sensitive to imports with dots
> SPARK-21030 extend hint syntax to support any expression for Python and R
>
> Comments in several of the doc issues suggest they are needed for 2.4
> though. How essential?
>
> (Brief digression: SPARK-21030 is an example of a pattern I see
> sometimes. Parent Epic A is targeted for version X. Children B and C
> are not. Epic A's description is basically "do X and Y". Is the parent
> helping? And now that Y is done, is there a point in tracking X with
> two JIRAs? can I just close the Epic?)
>
> I am not sure I've tried running K8S in my test runs before, but I get
> this on my Linux machine:
>
> [INFO] --- exec-maven-plugin:1.4.0:exec (setup-integration-test-env) @
> spark-kubernetes-integration-tests_2.12 ---
> fatal: not a git repository (or any of the parent directories): .git
> tar (child): --strip-components=1: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> scripts/setup-integration-test-env.sh: line 85:
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/
> integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
> No such file or directory
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> spark-kubernetes-integration-tests_2.12 ---
> Discovery starting.
> Discovery completed in 289 milliseconds.
> Run starting. Expected test count is: 14
> KubernetesSuite:
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED
> ***
>   java.lang.NullPointerException:
>   at org.apache.spark.deploy.k8s.integrationtest.
> KubernetesSuite.beforeAll(KubernetesSuite.scala:92)
>   at org.scalatest.BeforeAndAfterAll.liftedTree1$
> 1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org
> $scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:39)
>   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
>   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
>   at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.run(
> KubernetesSuite.scala:39)
>   at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1210)
>   at org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1257)
>   ...
>
> Clearly it's expecting something about the env that isn't true, but I
> don't know if it's a problem with those expectations versus what is in
> the source release, or, just something to do with my env. This is with
> Scala 2.12.
>
>
>
> On Mon, Oct 22, 2018 at 12:42 PM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
> >
> > The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> > 

Re: Plan on Structured Streaming in next major/minor release?

2018-10-21 Thread Stavros Kontopoulos
Hi Jungtaek,

I just tried to start the discussion in the dev list along time ago.
I enumerated some uses cases as Michael proposed here
<http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4r9vod+ebty1fdgtqsxzgjgubox-k8araur...@mail.gmail.com%3E>.
The discussion didn't go further.

If people find it useful we should start discussing it in detail again.

Stavros

On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim  wrote:

> Stavros, if my memory is right, you were trying to drive queryable state,
> right?
>
> Could you summary the progress and the reason why the progress got stopped?
>
> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com>님이 작성:
>
>> That is a very interesting list thanks. I could create a design doc as a
>> starting pointing for discussion if this is a feature we would like to have.
>>
>> Regards,
>> Stavros
>>
>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee  wrote:
>>
>>> Thanks for raising them.
>>>
>>> FYI, I believe this open issues could also be considered:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24630
>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>
>>> An new ability to express Struct Streaming on pure SQL.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>>
>>


Re: Plan on Structured Streaming in next major/minor release?

2018-10-21 Thread Stavros Kontopoulos
That is a very interesting list thanks. I could create a design doc as a
starting pointing for discussion if this is a feature we would like to have.

Regards,
Stavros

On Sun, Oct 21, 2018 at 3:04 PM, JackyLee  wrote:

> Thanks for raising them.
>
> FYI, I believe this open issues could also be considered:
>
> https://issues.apache.org/jira/browse/SPARK-24630
> 
>
> An new ability to express Struct Streaming on pure SQL.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Stavros Kontopoulos
@Marcelo is correct. Mesos does not have something similar. Only Yarn does
due to the distributed cache thing.
I have described most of the above in the the jira also there are some
other options.

Best,
Stavros

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin 
wrote:

> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> > Ideally this would all just be handled automatically for users in the
> way that all other resource managers do
>
> I think you're giving other resource managers too much credit. In
> cluster mode, only YARN really distributes local dependencies, because
> YARN has that feature (its distributed cache) and Spark just uses it.
>
> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
> anything similar on the Mesos side.
>
> There are things that could be done; e.g. if you have HDFS you could
> do a restricted version of what YARN does (upload files to HDFS, and
> change the "spark.jars" and "spark.files" URLs to point to HDFS
> instead). Or you could turn the submission client into a file server
> that the cluster-mode driver downloads files from - although that
> requires connectivity from the driver back to the client.
>
> Neither is great, but better than not having that feature.
>
> Just to be clear: in client mode things work right? (Although I'm not
> really familiar with how client mode works in k8s - never tried it.)
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Stavros Kontopoulos
Hi Rob,

Interesting topic and affects UX a lot. I provided my thoughts in the
related jira.

Best,
Stavros

On Fri, Oct 5, 2018 at 5:53 PM, Rob Vesse  wrote:

> Folks
>
>
>
> One of the big limitations of the current Spark on K8S implementation is
> that it isn’t possible to use local dependencies (SPARK-23153 [1]) i.e.
> code, JARs, data etc that only lives on the submission client.  This
> basically leaves end users with several options on how to actually run
> their Spark jobs under K8S:
>
>
>
>1. Store local dependencies on some external distributed file system
>e.g. HDFS
>2. Build custom images with their local dependencies
>3. Mount local dependencies into volumes that are mounted by the K8S
>pods
>
>
>
> In all cases the onus is on the end user to do the prep work.  Option 1 is
> unfortunately rare in the environments we’re looking to deploy Spark and
> Option 2 tends to be a non-starter as many of our customers whitelist
> approved images i.e. custom images are not permitted.
>
>
>
> Option 3 is more workable but still requires the users to provide a bunch
> of extra config options to configure this for simple cases or rely upon the
> pending pod template feature for complex cases.
>
>
>
> Ideally this would all just be handled automatically for users in the way
> that all other resource managers do, the K8S backend even did this at one
> point in the downstream fork but after a long discussion [2] this got
> dropped in favour of using Spark standard mechanisms i.e. spark-submit.
> Unfortunately this apparently was never followed through upon as it doesn’t
> work with master as of today.  Moreover I am unclear how this would work in
> the case of Spark on K8S cluster mode where the driver itself is inside a
> pod since the spark-submit mechanism is based upon copying from the drivers
> filesystem to the executors via a file server on the driver, if the driver
> is inside a pod it won’t be able to see local files on the submission
> client.  I think this may work out of the box with client mode but I
> haven’t dug into that enough to verify yet.
>
>
>
> I would like to start work on addressing this problem but to be honest I
> am unclear where to start with this.  It seems using the standard
> spark-submit mechanism is the way to go but I’m not sure how to get around
> the driver pod issue.  I would appreciate any pointers from folks who’ve
> looked at this previously on how and where to start on this.
>
>
>
> Cheers,
>
>
>
> Rob
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-23153
>
> [2] https://lists.apache.org/thread.html/82b4ae9a2eb5ddeb3f7240ebf154f0
> 6f19b830f8b3120038e5d687a1@%3Cdev.spark.apache.org%3E
>


Re: welcome a new batch of committers

2018-10-03 Thread Stavros Kontopoulos
Congrats!

On Wednesday, October 3, 2018, sujith chacko 
wrote:

> Great news Congrats all for achieving the feat !!!
>
> On Wed, 3 Oct 2018 at 2:29 PM, Reynold Xin  wrote:
>
>> Hi all,
>>
>> The Apache Spark PMC has recently voted to add several new committers to
>> the project, for their contributions:
>>
>> - Shane Knapp (contributor to infra)
>> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>> - Kazuaki Ishizaki (contributor to Spark SQL)
>> - Xingbo Jiang (contributor to Spark Core and SQL)
>> - Yinan Li (contributor to Spark on Kubernetes)
>> - Takeshi Yamamuro (contributor to Spark SQL)
>>
>> Please join me in welcoming them!
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-29 Thread Stavros Kontopoulos
Regarding Python 3.x upgrade referenced earlier. Some people already gone
down that path of upgrading:

https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-largest-python-3-migrations-ever

They describe some good reasons.

Stavros

On Tue, Sep 18, 2018 at 6:35 PM, Erik Erlandson  wrote:

> I like the notion of empowering cross platform bindings.
>
> The trend of computing frameworks seems to be that all APIs gradually
> converge on a stable attractor which could be described as "data frames and
> SQL"  Spark's early API design was RDD focused, but these days the center
> of gravity is all about DataFrame (Python's prevalence combined with its
> lack of a static type system substantially dilutes the benefits of DataSet,
> for any library development that aspires to both JVM and python support).
>
> I can imagine optimizing the developer layers of Spark APIs so that cross
> platform support and also 3rd-party support for new and existing Spark
> bindings would be maximized for "parallelizable dataframe+SQL"  Another of
> Spark's strengths is it's ability to federate heterogeneous data sources,
> and making cross platform bindings easy for that is desirable.
>
>
> On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra 
> wrote:
>
>> It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>>
>> For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>>
>> It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>>
>> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>>
>>> I am probably splitting hairs to finely, but I was considering the
>>> difference between improvements to the jvm-side (py4j and the scala/java
>>> code) that would make it easier to write the python layer ("python-friendly
>>> api"), and actual improvements to the python layers ("friendly python api").
>>>
>>> They're not mutually exclusive of course, and both worth working on. But
>>> it's *possible* to improve either without the other.
>>>
>>> Stub files look like a great solution for type annotations, maybe even
>>> if only python 3 is supported.
>>>
>>> I definitely agree that any decision to drop python 2 should not be
>>> taken lightly. Anecdotally, I'm seeing an increase in python developers
>>> announcing that they are dropping support for python 2 (and loving it). As
>>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>>> we're stuck with it until 4.0, which would place spark in a
>>> possibly-awkward position of supporting python 2 for some time after it
>>> goes EOL.
>>>
>>> 

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-29 Thread Stavros Kontopoulos
+1

Stavros

On Sat, Sep 29, 2018 at 5:59 AM, Sean Owen  wrote:

> +1, with comments:
>
> There are 5 critical issues for 2.4, and no blockers:
> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide
> SPARK-25323 ML 2.4 QA: API: Python API coverage
>
> Xiangrui, is SPARK-25378 important enough we need to get it into 2.4?
>
> I found two issues resolved for 2.4.1 that got into this RC, so marked
> them as resolved in 2.4.0.
>
> I checked the licenses and notice and they look correct now in source
> and binary builds.
>
> The 2.12 artifacts are as I'd expect.
>
> I ran all tests for 2.11 and 2.12 and they pass with -Pyarn
> -Pkubernetes -Pmesos -Phive -Phadoop-2.7 -Pscala-2.12.
>
>
>
>
> On Thu, Sep 27, 2018 at 10:00 PM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
> >
> > The vote is open until October 1 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.0-rc2 (commit
> 42f25f309e91c8cde1814e3720099ac1e64783da):
> > https://github.com/apache/spark/tree/v2.4.0-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1287
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-docs/
> >
> > The list of bug fixes going into 2.4.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.0?
> > ===
> >
> > The current list of open tickets targeted at 2.4.0 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Stavros Kontopoulos
There is a design document that covers a lot of concerns:
https://docs.google.com/document/d/1pcyH5f610X2jyJW9WbWHnj8jktQPLlbbmmUwdeK4fJk,
validation included.

We had a discussion about validation (validate before we hit the api
server) and was considered too much. In general regarding Rob's options,l I
prefer option 3 and that was the design's purpose.

For this feature to be completed, it needs to have some minimum
functionality implemented. What that might be depends on what you offer.

For example the current implementation does not allow users to customise
> the volumes used to back SPARK_LOCAL_DIRS to better suit the compute
> environment the K8S
>

It takes time to go through all the properties align them with Spark, but
for me that is the correct way to do this at the end of the day. Is the
above documented at all for example for the pod template
,? Are we

missing anything else? Are we going to cover similar cases with more PRs
like
https://github.com/apache/spark/commit/da6fa3828bb824b65f50122a8a0a0d4741551257
?

Option 3 would take quite some time and would add some code to the backend,
not sure of the size of it though, shouldnt be that much.

Also I don't think using a pod template is an advanced use of Spark on K8s.
Pod templates are used elsewhere by default, for example in kubeflow (for
replicas). In addition there are key properties for normal use

cases like pod affinity configuration, side car containers etc. These come
up often in real deployments. So it depends on who is experiencing what for
the UX part.


Stavros


On Wed, Sep 19, 2018 at 7:41 PM, Erik Erlandson  wrote:

> I can speak somewhat to the current design. Two of the goals for the
> design of this feature are that
> (1) its behavior is easy to reason about
> (2) its implementation in the back-end is light weight
>
> Option 1 was chosen partly because it's behavior is relatively simple to
> describe to a user: "Your template will be taken as the starting point.
> Spark may override a certain small set of fields (documented in a table)
> that are necessary for its internal functioning."
>
> This also keeps the actual back-end implementation relatively light
> weight. It can load the template (which also includes syntax validation)
> into a pod structure, then modify any fields it needs to (per above).
>
>
> On Wed, Sep 19, 2018 at 9:11 AM, Rob Vesse  wrote:
>
>> Hey all
>>
>>
>>
>> For those following the K8S backend you are probably aware of SPARK-24434
>> [1] (and PR 22416 [2]) which proposes a mechanism to allow for advanced pod
>> customisation via pod templates.  This is motivated by the fact that
>> introducing additional Spark configuration properties for each aspect of
>> pod specification a user might wish to customise was becoming unwieldy.
>>
>>
>>
>> However I am concerned that the current implementation doesn’t go far
>> enough and actually limits the utility of the proposed new feature.  The
>> problem stems from the fact that the implementation simply uses the pod
>> template as a base and then Spark attempts to build a pod spec on top of
>> that.  As the code that does this doesn’t do any kind of validation or
>> inspection of the incoming template it is possible to provide a template
>> that causes Spark to generate an invalid pod spec ultimately causing the
>> job to be rejected by Kubernetes.
>>
>>
>>
>> Now clearly Spark code cannot attempt to account for every possible
>> customisation that a user may attempt to make via pod templates nor should
>> it be responsible for ensuring that the user doesn’t start from an invalid
>> template in the first place.  However it seems like we could be more
>> intelligent in how we build our pod specs to avoid generating invalid specs
>> in cases where we have a clear use case for advanced customisation.  For
>> example the current implementation does not allow users to customise the
>> volumes used to back SPARK_LOCAL_DIRS to better suit the compute
>> environment the K8S cluster is running on and trying to do so with a pod
>> template will result in an invalid spec due to duplicate volumes.
>>
>>
>>
>> I think there are a few ways the community could address this:
>>
>>
>>
>>1. Status quo – provide the pod template feature as-is and simply
>>tell users that certain customisations are never supported and may result
>>in invalid pod specs
>>2. Provide the ability for advanced users to explicitly skip pod spec
>>building steps they know interfere with their pod templates via
>>configuration properties
>>3. Modify the pod spec building code to be aware of known desirable
>>user customisation points and avoid generating  invalid specs in those 
>> cases
>>
>>
>>
>> Currently committers seem to be going for Option 1.  Personally I would
>> like to see the community adopt option 3 but have already received
>> considerable pushback when I proposed that in one of my PRs hence the
>> suggestion 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
gt;>>> >> >>
>>>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>>>> you
>>>> >> >> able to build 2.12 artifacts as well?
>>>> >> >>
>>>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>>> wrote:
>>>> >> >> >
>>>> >> >> > Please vote on releasing the following candidate as Apache
>>>> Spark version 2.4.0.
>>>> >> >> >
>>>> >> >> > The vote is open until September 20 PST and passes if a
>>>> majority +1 PMC votes are cast, with
>>>> >> >> > a minimum of 3 +1 votes.
>>>> >> >> >
>>>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>>> >> >> > [ ] -1 Do not release this package because ...
>>>> >> >> >
>>>> >> >> > To learn more about Apache Spark, please see
>>>> http://spark.apache.org/
>>>> >> >> >
>>>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>>> >> >> >
>>>> >> >> > The release files, including signatures, digests, etc. can be
>>>> found at:
>>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>>> >> >> >
>>>> >> >> > Signatures used for Spark RCs can be found in this file:
>>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >> >> >
>>>> >> >> > The staging repository for this release can be found at:
>>>> >> >> > https://repository.apache.org/content/repositories/
>>>> orgapachespark-1285/
>>>> >> >> >
>>>> >> >> > The documentation corresponding to this release can be found at:
>>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>>>> >> >> >
>>>> >> >> > The list of bug fixes going into 2.4.0 can be found at the
>>>> following URL:
>>>> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>>> >> >> >
>>>> >> >> > FAQ
>>>> >> >> >
>>>> >> >> > =
>>>> >> >> > How can I help test this release?
>>>> >> >> > =
>>>> >> >> >
>>>> >> >> > If you are a Spark user, you can help us test this release by
>>>> taking
>>>> >> >> > an existing Spark workload and running on this release
>>>> candidate, then
>>>> >> >> > reporting any regressions.
>>>> >> >> >
>>>> >> >> > If you're working in PySpark you can set up a virtual env and
>>>> install
>>>> >> >> > the current RC and see if anything important breaks, in the
>>>> Java/Scala
>>>> >> >> > you can add the staging repository to your projects resolvers
>>>> and test
>>>> >> >> > with the RC (make sure to clean up the artifact cache
>>>> before/after so
>>>> >> >> > you don't end up building with a out of date RC going forward).
>>>> >> >> >
>>>> >> >> > ===
>>>> >> >> > What should happen to JIRA tickets still targeting 2.4.0?
>>>> >> >> > ===
>>>> >> >> >
>>>> >> >> > The current list of open tickets targeted at 2.4.0 can be found
>>>> at:
>>>> >> >> > https://issues.apache.org/jira/projects/SPARK and search for
>>>> "Target Version/s" = 2.4.0
>>>> >> >> >
>>>> >> >> > Committers should look at those and triage. Extremely important
>>>> bug
>>>> >> >> > fixes, documentation, and API tweaks that impact compatibility
>>>> should
>>>> >> >> > be worked on immediately. Everything else please retarget to an
>>>> >> >> > appropriate release.
>>>> >> >> >
>>>> >> >> > ==
>>>> >> >> > But my bug isn't fixed?
>>>> >> >> > ==
>>>> >> >> >
>>>> >> >> > In order to make timely releases, we will typically not hold the
>>>> >> >> > release unless the bug in question is a regression from the
>>>> previous
>>>> >> >> > release. That being said, if there is something which is a
>>>> regression
>>>> >> >> > that has not been correctly targeted please ping me or a
>>>> committer to
>>>> >> >> > help target the issue.
>>>>
>>>>
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
I just follow the comment Wehnchen Fan (of course it is not merged yet, but
I wanted to bring this to the attention of the dev list)

"We should definitely merge it to branch 2.4, but I won't block the release
since it's not that critical and it's still in progress. After it's merged,
feel free to vote -1 on the RC voting email to include this change, if
necessary."


So if the vote is not valid, we can ignore it. But this should have been
in, before 2.4 was cut IMHO, anyway.


Stavros


On Mon, Sep 17, 2018 at 4:53 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I believe -1 votes are merited only for correctness bugs and regressions
> since the previous release.
>
> Does SPARK-23200 count as either?
>
> 2018년 9월 17일 (월) 오전 9:40, Stavros Kontopoulos  lightbend.com>님이 작성:
>
>> -1
>>
>> I would like to see: https://github.com/apache/spark/pull/22392 in, as
>> discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
>> important IMHO for streaming on K8s.
>> I just started testing it btw.
>>
>> Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
>> https://github.com/scala/scala/milestone/73 is coming out (will be
>> staged this week), do we want to build the beta 2.12 build against it?
>>
>> Stavros
>>
>> On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:
>>
>>> I confirmed that https://repository.apache.org/content/
>>> repositories/orgapachespark-1285 is not accessible. I did it via
>>> ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish ,
>>> not sure what's going wrong. I didn't see any error message during it.
>>>
>>> Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>>
>>> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>>
>>>> I think one build is enough, but haven't thought it through. The
>>>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>>> Really, whatever's the easy thing to do.
>>>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>>> wrote:
>>>> >
>>>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>>> Scala 2.12?
>>>> >
>>>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>>>> >>
>>>> >> A few preliminary notes:
>>>> >>
>>>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>>>> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>>>> verify
>>>> >> the signature. No issue there really.
>>>> >>
>>>> >> The staging repo gives a 404:
>>>> >> https://repository.apache.org/content/repositories/
>>>> orgapachespark-1285/
>>>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>>> >> [id=orgapachespark-1285] exists but is not exposed.
>>>> >>
>>>> >> The (revamped) licenses are OK, though there are some minor glitches
>>>> >> in the final release tarballs (my fault) : there's an extra
>>>> directory,
>>>> >> and the source release has both binary and source licenses. I'll fix
>>>> >> that. Not strictly necessary to reject the release over those.
>>>> >>
>>>> >> Last, when I check the staging repo I'll get my answer, but, were you
>>>> >> able to build 2.12 artifacts as well?
>>>> >>
>>>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>>> wrote:
>>>> >> >
>>>> >> > Please vote on releasing the following candidate as Apache Spark
>>>> version 2.4.0.
>>>> >> >
>>>> >> > The vote is open until September 20 PST and passes if a majority
>>>> +1 PMC votes are cast, with
>>>> >> > a minimum of 3 +1 votes.
>>>> >> >
>>>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>>> >> > [ ] -1 Do not release this package because ...
>>>> >> >
>>>> >> > To learn more about Apache Spark, please see
>>>> http://spark.apache.or

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
-1

I would like to see: https://github.com/apache/spark/pull/22392 in, as
discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
important IMHO for streaming on K8s.
I just started testing it btw.

Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
https://github.com/scala/scala/milestone/73 is coming out (will be staged
this week), do we want to build the beta 2.12 build against it?

Stavros

On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:

> I confirmed that https://repository.apache.org/content/
> repositories/orgapachespark-1285 is not accessible. I did it via
> ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish ,
> not sure what's going wrong. I didn't see any error message during it.
>
> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>
> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>
>> I think one build is enough, but haven't thought it through. The
>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> Really, whatever's the easy thing to do.
>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan  wrote:
>> >
>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala
>> 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop
>> 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for Scala
>> 2.12?
>> >
>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>> >>
>> >> A few preliminary notes:
>> >>
>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>> >> the signature. No issue there really.
>> >>
>> >> The staging repo gives a 404:
>> >> https://repository.apache.org/content/repositories/
>> orgapachespark-1285/
>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> [id=orgapachespark-1285] exists but is not exposed.
>> >>
>> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> in the final release tarballs (my fault) : there's an extra directory,
>> >> and the source release has both binary and source licenses. I'll fix
>> >> that. Not strictly necessary to reject the release over those.
>> >>
>> >> Last, when I check the staging repo I'll get my answer, but, were you
>> >> able to build 2.12 artifacts as well?
>> >>
>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.0.
>> >> >
>> >> > The vote is open until September 20 PST and passes if a majority +1
>> PMC votes are cast, with
>> >> > a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.4.0-rc1 (commit
>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> > https://repository.apache.org/content/repositories/
>> orgapachespark-1285/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>> >> >
>> >> > The list of bug fixes going into 2.4.0 can be found at the following
>> URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 2.4.0?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 2.4.0 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.4.0
>> >> >
>> >> > 

[DISCUSS][CORE] Exposing application status metrics via a source

2018-09-12 Thread Stavros Kontopoulos
Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes
application status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics
like job delay, stages failed, stages completed etc.
>From devops perspective it is good to standardize on a unified way of
gathering metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly
used to scrape metrics for several components such as kafka, cassandra, but
the need is not limited there.

Check comment here
:
"The rest api is great for UI and consolidated analytics, but monitoring
through it is not as straightforward as when the data emits directly from
the source like this. There is all kinds of nice context that we get when
the data from this spark node is collected directly from the node itself,
and not proxied through another collector / reporter. It is easier to build
a monitoring data model across the cluster when node, jmx, pod, resource
manifests, and spark data all align by virtue of coming from the same
collector. Building a similar view of the cluster just from the rest api,
as a comparison, is simply harder and quite challenging to do in general
purpose terms."

The PR is ok to be merged but the major concern here is the mirroring of
the metrics. I think that mirroring is ok since people may dont want to
check the ui and they just want to integrate with jmx only (my use case)
and gather metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition
does not change much and can always be refactored if we come up with a new
plan for the metrics story in the future.

Thanks,
Stavros


custom sink & model transformation

2018-09-10 Thread Stavros Kontopoulos
Hi,

Just copying form users, since got no response.

Is it unsfate to do model prediction within a custom sink eg.
model.transform(df)?
I see that the only transformation done is adding a prediction column
AFAIK, does that change the execution plan?

Thanks,
Stavros


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Stavros Kontopoulos
+1 that would be great Sean, also you put a lot of effort in there, would
make sense to wait a bit.

Stavros

On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen  wrote:

> I know it's famous last words, but we really might be down to the last
> fix: https://github.com/apache/spark/pull/22264 More a question of making
> tests happy at this point I think than fundamental problems. My goal is to
> make sure we can release a usable, but beta-quality, 2.12 release of Spark
> in 2.4.
>
> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:
>
>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>> coming
>> up and we don't need to block Spark 2.4 on this.
>>
>> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
>> to
>> suffer many months until Spark 2.5 with 2.12 support will be released.
>> Scala
>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>> then people will not be able to use them in their Zeppelin, Jupyter and
>> other notebooks together with Spark.
>>
>>


Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
The root cause for a case where closure cleaner is involved is described
here: https://github.com/apache/spark/pull/22004/files#r207753682 but I am
also waiting for some feedback from Lukas Rytz why this even worked in
2.11.
If it is something that needs fix and can be fixed we will fix and add test
cases for sure. I do understand the UX issue and that is why I mentioned
this in the first place.
It is my concern too. Meanwhile sometimes adoption requires changes. Best
case only implementation changes. Worst case the way you use something
changes as well, not to mention that this is not the common scenario that
fails and the user has options. Wouldnt say that it is dentrimental but
anyway.
I propose we move the discussion to
https://issues.apache.org/jira/browse/SPARK-25029 as this is an umbrella
jira for this and others.
Anyway we are looking into this and also the janino thing.

Stavros

On Mon, Aug 6, 2018 at 1:18 PM, Mridul Muralidharan 
wrote:

>
> A spark user’s expectation would be that any closure which worked in 2.11
> will continue to work in 2.12 (exhibiting same behavior wrt functionality,
> serializability, etc).
> If there are behavioral changes, we will need to understand what they are
> - but expection would be that they are minimal (if any) source changes for
> users/libraries - requiring otherwise would be very detrimental to adoption.
>
> Do we know the root cause here ? I am not sure how well we test the
> cornercases in cleaner- if this was not caught by suite, perhaps we should
> augment it ...
>
> Regards
> Mridul
>
> On Mon, Aug 6, 2018 at 1:08 AM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Closure cleaner's initial purpose AFAIK is to clean the dependencies
>> brought in with outer pointers (compiler's side effect). With LMFs in
>> Scala 2.12 there are no outer pointers, that is why in the new design
>> document we kept the implementation minimal focusing on the return
>> statements (it was intentional). Also the majority of the generated
>> closures AFAIK are of type LMF.
>> Regarding references in the LMF body that was not part of the doc since
>> we expect the user not to point to non-serializable objects etc.
>> In all these cases you know you are adding references you shouldn't.
>> If users were used to another UX we can try fix it, not sure how well
>> this worked in the past though and if covered all cases.
>>
>> Regards,
>> Stavros
>>
>> On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
>> wrote:
>>
>>> I agree, we should not work around the testcase but rather understand
>>> and fix the root cause.
>>> Closure cleaner should have null'ed out the references and allowed it
>>> to be serialized.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
>>> >
>>> > It seems to me that the closure cleaner fails to clean up something.
>>> The failed test case defines a serializable class inside the test case, and
>>> the class doesn't refer to anything in the outer class. Ideally it can be
>>> serialized after cleaning up the closure.
>>> >
>>> > This is somehow a very weird way to define a class, so I'm not sure
>>> how serious the problem is.
>>> >
>>> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com> wrote:
>>> >>
>>> >> Makes sense, not sure if closure cleaning is related to the last one
>>> for example or others. The last one is a bit weird, unless I am missing
>>> something about the LegacyAccumulatorWrapper logic.
>>> >>
>>> >> Stavros
>>> >>
>>> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
>>> >>>
>>> >>> Yep that's what I did. There are more failures with different
>>> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
>>> changes are all reasonable, and not an artifact of missing something about
>>> closure cleaning in 2.12.
>>> >>>
>>> >>> In the meantime having a 2.12 build up and running for master will
>>> just help catch these things.
>>> >>>
>>> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com> wrote:
>>> >>>>
>>> >>>> Hi Sean,
>>> >>>>
>>> >>>> I run a quick build so the failing tests seem to be:
>>> >>>>
>>> >>>> - SPARK-17644: After one stage is aborted for t

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
Closure cleaner's initial purpose AFAIK is to clean the dependencies
brought in with outer pointers (compiler's side effect). With LMFs in Scala
2.12 there are no outer pointers, that is why in the new design document we
kept the implementation minimal focusing on the return statements (it was
intentional). Also the majority of the generated closures AFAIK are of type
LMF.
Regarding references in the LMF body that was not part of the doc since we
expect the user not to point to non-serializable objects etc.
In all these cases you know you are adding references you shouldn't.
If users were used to another UX we can try fix it, not sure how well this
worked in the past though and if covered all cases.

Regards,
Stavros

On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
wrote:

> I agree, we should not work around the testcase but rather understand
> and fix the root cause.
> Closure cleaner should have null'ed out the references and allowed it
> to be serialized.
>
> Regards,
> Mridul
>
> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
> >
> > It seems to me that the closure cleaner fails to clean up something. The
> failed test case defines a serializable class inside the test case, and the
> class doesn't refer to anything in the outer class. Ideally it can be
> serialized after cleaning up the closure.
> >
> > This is somehow a very weird way to define a class, so I'm not sure how
> serious the problem is.
> >
> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos  lightbend.com> wrote:
> >>
> >> Makes sense, not sure if closure cleaning is related to the last one
> for example or others. The last one is a bit weird, unless I am missing
> something about the LegacyAccumulatorWrapper logic.
> >>
> >> Stavros
> >>
> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
> >>>
> >>> Yep that's what I did. There are more failures with different
> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
> changes are all reasonable, and not an artifact of missing something about
> closure cleaning in 2.12.
> >>>
> >>> In the meantime having a 2.12 build up and running for master will
> just help catch these things.
> >>>
> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
> >>>>
> >>>> Hi Sean,
> >>>>
> >>>> I run a quick build so the failing tests seem to be:
> >>>>
> >>>> - SPARK-17644: After one stage is aborted for too many failed
> attempts, subsequent stagesstill behave correctly on fetch failures ***
> FAILED ***
> >>>>   A job with one fetch failure should eventually succeed
> (DAGSchedulerSuite.scala:2422)
> >>>>
> >>>>
> >>>> - LegacyAccumulatorWrapper with AccumulatorParam that has no
> equals/hashCode *** FAILED ***
> >>>>   java.io.NotSerializableException: org.scalatest.Assertions$
> AssertionsHelper
> >>>> Serialization stack:
> >>>> - object not serializable (class: 
> >>>> org.scalatest.Assertions$AssertionsHelper,
> value: org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
> >>>>
> >>>>
> >>>> The last one can be fixed easily if you set class `MyData(val i: Int)
> extends Serializable `outside of the test suite. For some reason outers
> (not removed) are capturing
> >>>> the Scalatest stuff in 2.12.
> >>>>
> >>>> Let me know if we see the same failures.
> >>>>
> >>>> Stavros
> >>>>
> >>>> On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
> >>>>>
> >>>>> Shane et al - could we get a test job in Jenkins to test the Scala
> 2.12 build? I don't think I have the access or expertise for it, though I
> could probably copy and paste a job. I think we just need to clone the,
> say, master Maven Hadoop 2.7 job, and add two steps: run
> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
> profiles that are enabled.
> >>>>>
> >>>>> I can already see two test failures for the 2.12 build right now and
> will try to fix those, but this should help verify whether the failures are
> 'real' and detect them going forward.
> >>>>>
> >>>>>
> >>>>
> >>
> >>
> >>
>


Re: Set up Scala 2.12 test build in Jenkins

2018-08-05 Thread Stavros Kontopoulos
Makes sense, not sure if closure cleaning is related to the last one for
example or others. The last one is a bit weird, unless I am missing
something about the LegacyAccumulatorWrapper logic.

Stavros

On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:

> Yep that's what I did. There are more failures with different resolutions.
> I'll open a JIRA and PR and ping you, to make sure that the changes are all
> reasonable, and not an artifact of missing something about closure cleaning
> in 2.12.
>
> In the meantime having a 2.12 build up and running for master will just
> help catch these things.
>
> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Hi Sean,
>>
>> I run a quick build so the failing tests seem to be:
>>
>> - SPARK-17644: After one stage is aborted for too many failed attempts, 
>> subsequent stagesstill behave correctly on fetch failures *** FAILED ***
>>   A job with one fetch failure should eventually succeed 
>> (DAGSchedulerSuite.scala:2422)
>>
>>
>> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
>> *** FAILED ***
>>   java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper
>> Serialization stack:
>>  - object not serializable (class: 
>> org.scalatest.Assertions$AssertionsHelper, value: 
>> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
>>
>>
>> The last one can be fixed easily if you set class `MyData(val i: Int)
>> extends Serializable `outside of the test suite. For some reason outers
>> (not removed) are capturing
>> the Scalatest stuff in 2.12.
>>
>> Let me know if we see the same failures.
>>
>> Stavros
>>
>> On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
>>
>>> Shane et al - could we get a test job in Jenkins to test the Scala 2.12
>>> build? I don't think I have the access or expertise for it, though I could
>>> probably copy and paste a job. I think we just need to clone the, say,
>>> master Maven Hadoop 2.7 job, and add two steps: run
>>> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
>>> profiles that are enabled.
>>>
>>> I can already see two test failures for the 2.12 build right now and
>>> will try to fix those, but this should help verify whether the failures are
>>> 'real' and detect them going forward.
>>>
>>>
>>>
>>


Re: Set up Scala 2.12 test build in Jenkins

2018-08-05 Thread Stavros Kontopoulos
Hi Sean,

I run a quick build so the failing tests seem to be:

- SPARK-17644: After one stage is aborted for too many failed
attempts, subsequent stagesstill behave correctly on fetch failures
*** FAILED ***
  A job with one fetch failure should eventually succeed
(DAGSchedulerSuite.scala:2422)


- LegacyAccumulatorWrapper with AccumulatorParam that has no
equals/hashCode *** FAILED ***
  java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper
Serialization stack:
- object not serializable (class:
org.scalatest.Assertions$AssertionsHelper, value:
org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)


The last one can be fixed easily if you set class `MyData(val i: Int)
extends Serializable `outside of the test suite. For some reason outers
(not removed) are capturing
the Scalatest stuff in 2.12.

Let me know if we see the same failures.

Stavros

On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:

> Shane et al - could we get a test job in Jenkins to test the Scala 2.12
> build? I don't think I have the access or expertise for it, though I could
> probably copy and paste a job. I think we just need to clone the, say,
> master Maven Hadoop 2.7 job, and add two steps: run
> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
> profiles that are enabled.
>
> I can already see two test failures for the 2.12 build right now and will
> try to fix those, but this should help verify whether the failures are
> 'real' and detect them going forward.
>
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Stavros Kontopoulos
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8
lambdas in ClosureCleaner).
This should allows us to add support for Scala 2.12, I think we can resolve
this long standing issue with 2.4.

Best,
Stavros

On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda 
wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as I
> remember, Jose Torres said it should  be easy to perform aggregations if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us: https://issues.apache.org/
> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
> think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It only
> has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release, but
> there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. I
> believe this should be
> >> the last significant API change to data source v2, so great to have in
> Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
> >>
> >> SPARK-24768: Have a built-in AVRO data source implementation
> >> Most of it is done, but date/timestamp support is still missing. Great
> to have in 2.4.
> >>
> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
> answers
> >> This is a long-standing correctness bug, great to have in 2.4.
> >>
> >> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
> >>
> >> Feel free to add more things if you think they are important to Spark
> 2.4 by replying to this email.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
> >>
> >>   In theory releases happen on a time-based cadence, so it's pretty
> much wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the
> >>   code freeze out a few weeks every time. So, kind of a hybrid approach
> here that works OK.
> >>
> >>   Certainly speak up if you think there's something that really needs
> to get into 2.4. This is that discuss thread.
> >>
> >>   (BTW I updated the page you mention just yesterday, to reflect the
> plan suggested in this thread.)
> >>
> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>  wrote:
> >>
> >>   Shouldn't this be a discuss thread?
> >>
> >>   I'm also happy to see more release managers and agree the time is
> getting close, but we should see what features are in progress and see how
> close things are and propose a date based on that.  Cutting a branch to
> soon just creates
> >>   more work for committers to push to more branches.
> >>
> >>http://spark.apache.org/versioning-policy.html mentioned the code
> freeze and release branch cut mid-august.
> >>
> >>   Tom
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Stavros Kontopoulos
+1. That would great!

Thanks,
Stavros

On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan  wrote:

> If no one objects, how about we make the code freeze one week later(Aug
> 8th)?
>
> BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
> I'm familiar with most of the major features targeted for the 2.4 release.
> I also have a lot of free time during this release timeframe and should be
> able to figure out problems that may appear during the release.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Extending code freeze date would be great for me too, I am working on a
>> PR for supporting scala 2.12, I am close but need some more time.
>> We could get it into 2.4.
>>
>> Stavros
>>
>> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:
>>
>>> This seems fine to me.
>>>
>>> BTW Ryan Blue and I are working on some data source v2 stuff and
>>> hopefully we can get more things done with one more week.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>>> wrote:
>>>
>>>> Xiangrui and I are leading an effort to implement a highly desirable
>>>> feature, Barrier Execution Mode. https://issues.apache.org/
>>>> jira/browse/SPARK-24374. This introduces a new scheduling model to
>>>> Apache Spark so users can properly embed distributed DL training as a Spark
>>>> stage to simplify the distributed training workflow. The prototype has been
>>>> demoed in the Spark Summit keynote. This new feature got a very positive
>>>> feedback from the whole community. The design doc and pull requests got
>>>> more comments than we initially anticipated. We want to finish this feature
>>>> in the upcoming release, Spark 2.4. Would it be possible to have an
>>>> extension of code freeze for a week?
>>>>
>>>> Thanks,
>>>>
>>>> Xingbo
>>>>
>>>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>>>
>>>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>>>
>>>>>
>>>>
>>
>>
>> --
>> Stavros Kontopoulos
>>
>> *Senior Software Engineer*
>> *Lightbend, Inc.*
>>
>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>> *e: stavros.kontopou...@lightbend.com* 
>>
>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Stavros Kontopoulos
Extending code freeze date would be great for me too, I am working on a PR
for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:

> This seems fine to me.
>
> BTW Ryan Blue and I are working on some data source v2 stuff and hopefully
> we can get more things done with one more week.
>
> Thanks,
> Wenchen
>
> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
> wrote:
>
>> Xiangrui and I are leading an effort to implement a highly desirable
>> feature, Barrier Execution Mode. https://issues.apache.org/
>> jira/browse/SPARK-24374. This introduces a new scheduling model to
>> Apache Spark so users can properly embed distributed DL training as a Spark
>> stage to simplify the distributed training workflow. The prototype has been
>> demoed in the Spark Summit keynote. This new feature got a very positive
>> feedback from the whole community. The design doc and pull requests got
>> more comments than we initially anticipated. We want to finish this feature
>> in the upcoming release, Spark 2.4. Would it be possible to have an
>> extension of code freeze for a week?
>>
>> Thanks,
>>
>> Xingbo
>>
>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: Time for 2.3.2?

2018-06-28 Thread Stavros Kontopoulos
+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
wrote:

> +1 too, I'd consider also to include SPARK-24208 if we can solve it
> timely...
>
> 2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro :
>
>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>
>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>> wrote:
>>
>>> +1
>>>
>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>>
>>>> Hi Saisai, that's great! please go ahead!
>>>>
>>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>>> wrote:
>>>>
>>>>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>>>>
>>>>> I can work on the release if short of hands :).
>>>>>
>>>>> Thanks
>>>>> Jerry
>>>>>
>>>>>
>>>>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>>>>
>>>>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>>>>>> for those out.
>>>>>>
>>>>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>>>>
>>>>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>>>>> wrote:
>>>>>> > Hi all,
>>>>>> >
>>>>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>>>>> discovered
>>>>>> > and fixed some critical issues afterward.
>>>>>> >
>>>>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>>>>> > This is a serious correctness bug, and is easy to hit: have
>>>>>> duplicated join
>>>>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
>>>>>> and the
>>>>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>>>>> >
>>>>>> > SPARK-24588: stream-stream join may produce wrong result
>>>>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>>>>> stream-stream
>>>>>> > join. Users can hit this bug if one of the join side is partitioned
>>>>>> by a
>>>>>> > subset of the join keys.
>>>>>> >
>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>>>>>> > This is a long-standing bug in the output committer that may
>>>>>> introduce data
>>>>>> > corruption.
>>>>>> >
>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
>>>>>> to
>>>>>> > access arbitrary files
>>>>>> > This is a potential security issue if users build access control
>>>>>> module upon
>>>>>> > Spark.
>>>>>> >
>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially the
>>>>>> > correctness bugs) ASAP. Any thoughts?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Wenchen
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Marcelo
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: Scala 2.12 support

2018-06-21 Thread Stavros Kontopoulos
Hi all,

Scala team @Lightbend (Lukas, Adriaan, Jason) and I, have worked for a
couple of days now on this.

We have captured current status and possible solutions for the remaining
two issues here: https://docs.google.com/document/d/
1fbkjEL878witxVQpOCbjlvOvadHtVjYXeB-2mgzDTvk

Please review the work so we can move forward with this long-standing issue.

PS. I think my previous msg didnt reach the list...

Best,
Stavros


On Thu, Jun 21, 2018 at 3:37 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Hi all,
>
> Scala team @Lightbend (Lukas, Adriaan, Jason) and I, have worked for a
> couple of days now on this.
>
> We have captured current status and possible solutions for the remaining
> two issues here: https://docs.google.com/document/d/
> 1fbkjEL878witxVQpOCbjlvOvadHtVjYXeB-2mgzDTvk
>
> Please review the work so we can move forward with this long-standing
> issue.
>
> Best,
> Stavros
>
> On Fri, Jun 8, 2018 at 4:55 AM, Sean Owen  wrote:
>
>> When I updated for Scala 2.12, I was able to remove almost all the
>> 2.11-2.12 differences. There are still already two source trees for 2.11 vs
>> 2.12. I mean that if it's necessary to accommodate differences between the
>> two, it's already set up for that, and there aren't a mess of differences
>> to patch over. Probably quite possible if necessary.
>>
>> On Thu, Jun 7, 2018 at 8:50 PM DB Tsai  wrote:
>>
>>> It is from the most recent 2.11
>>>
>>> I don’t try it yet on 2.12, but I expect to get the same result.
>>>
>>> On Thu, Jun 7, 2018 at 6:28 PM Wenchen Fan  wrote:
>>>
>>>> One more point: There was a time that we maintain 2 Spark REPL codebase
>>>> for Scala 2.10 and 2.11, maybe we can do the same for Scala 2.11 and 2.12?
>>>> if it's too hard to find a common way to do that between different Scala
>>>> versions.
>>>>
>>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: queryable state & streaming

2017-12-09 Thread Stavros Kontopoulos
Nice I was looking for a jira. So I agree we should justify why we are
building something. Now to that direction here is what I have seen from my
experience.
People quite often use state within their streaming app and may have large
states (TBs). Shortening the pipeline by not having to copy data (to
Cassandra for example for serving) is an advantage, in terms of at least
latency and complexity.
This can be true if we advantage of state checkpointing (locally could be
RocksDB or in general HDFS the latter is currently supported)  along with
an API to efficiently query data.
Some use cases I see:

- real-time dashboards and real-time reporting, the faster the better
- monitoring of state for operational reasons, app health etc...
- integrating with external services via an API eg. making accessible
 aggregations over time windows to some third party service within your
system

Regarding requirements here are some of them:
- support of an API to expose state (could be done at the spark driver),
like rest.
- supporting dynamic allocation (not sure how it affects state management)
- an efficient way to talk to executors to get the state (rpc?)
- making local state more efficient and easier accessible with an embedded
db (I dont think this is supported from what I see, maybe wrong)?
Some people are already working with such techs and some stuff could be
re-used: https://issues.apache.org/jira/browse/SPARK-20641

Best,
Stavros


On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> https://issues.apache.org/jira/browse/SPARK-16738
>
> I don't believe anyone is working on it yet.  I think the most useful
> thing is to start enumerating requirements and use cases and then we can
> talk about how to build it.
>
> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
>
>> Cool Burak do you have a pointer, should I take the initiative for a
>> first design document or Databricks is working on it?
>>
>> Best,
>> Stavros
>>
>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>>> Hi Stavros,
>>>
>>> Queryable state is definitely on the roadmap! We will revamp the
>>> StateStore API a bit, and a queryable StateStore is definitely one of the
>>> things we are thinking about during that revamp.
>>>
>>> Best,
>>> Burak
>>>
>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <st.kontopou...@gmail.com>
>>> wrote:
>>>
>>>> Just to re-phrase my question: Would query-able state make a viable
>>>> SPIP?
>>>>
>>>> Regards,
>>>> Stavros
>>>>
>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>>> st.kontopou...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Maybe this has been discussed before. Given the fact that many
>>>>> streaming apps out there use state extensively, could be a good idea to
>>>>> make Spark expose streaming state with an external API like other
>>>>> systems do (Kafka streams, Flink etc), in order to facilitate
>>>>> interactive queries?
>>>>>
>>>>> Regards,
>>>>> Stavros
>>>>>
>>>>
>>>>
>>
>


Re: queryable state & streaming

2017-12-08 Thread Stavros Kontopoulos
Just to re-phrase my question: Would query-able state make a viable SPIP?

Regards,
Stavros

On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Hi,
>
> Maybe this has been discussed before. Given the fact that many streaming
> apps out there use state extensively, could be a good idea to make Spark
> expose streaming state with an external API like other systems do (Kafka
> streams, Flink etc), in order to facilitate interactive queries?
>
> Regards,
> Stavros
>


queryable state & streaming

2017-12-07 Thread Stavros Kontopoulos
Hi,

Maybe this has been discussed before. Given the fact that many streaming
apps out there use state extensively, could be a good idea to make Spark
expose streaming state with an external API like other systems do (Kafka
streams, Flink etc), in order to facilitate interactive queries?

Regards,
Stavros


Re: SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

2016-02-17 Thread Stavros Kontopoulos
Cool work! I will have a look to the project.

Cheers

On Fri, Feb 5, 2016 at 11:09 AM, Pete Robbins  wrote:

> Yiannis,
>
> I'm interested in what you've done here as I was looking for ways to allow
> the Spark UI to display custom metrics in a pluggable way without having to
> modify the Spark source code. It would be good to see if we could have
> modify your code to add extension points into the UI so we could configure
> sources of the additional metrics. So for instance rather than creating
> events from your HDFS files I would like to have a module that is pulling
> in system/jvm metrics that are in eg Elasticsearch.
>
> Do any of the Spark committers have any thoughts on this?
>
> Cheers,
>
>
> On 3 February 2016 at 15:26, Yiannis Gkoufas  wrote:
>
>> Hi all,
>>
>> I just wanted to introduce some of my recent work in IBM Research around
>> Spark and especially its Metric System and Web UI.
>> As a quick overview of our contributions:
>> We have a created a new type of Sink for the metrics ( HDFSSink ) which
>> captures the metrics into HDFS,
>> We have extended the metrics reported by the Executors to include
>> OS-level metrics regarding CPU, RAM, Disk IO, Network IO utilizing the
>> Hyperic Sigar library
>> We have extended the Web UI for the completed applications to visualize
>> any of the above metrics the user wants to.
>> The above functionalities can be configured in the metrics.properties and
>> spark-defaults.conf files.
>> We have recorded a small demo that shows those capabilities which you can
>> find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
>> There is a blog post which gives more details on the functionality here:
>> *www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/*
>> 
>> and also there is a public repo where anyone can try it:
>> *https://github.com/ibm-research-ireland/sparkoscope*
>> 
>>
>> I would really appreciate any feedback or advice regarding this work.
>> Especially if you think it's worth it to upstream to the official Spark
>> repository.
>>
>> Thanks a lot!
>>
>
>


-- 






Re: Using spark MLlib without installing Spark

2015-11-25 Thread Stavros Kontopoulos
You can even use it without spark as well (besides local). For example i
have used the following algo in some web app:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

Essentially some algorithms (i havent checked them all) they will have to
run the same steps in each partition so if you overlook the distributed
oriented parts (spark specific) of the code there is a lot of resuable
stuff.
You have just to use the api where that is public and conform to the
input/output contract of it.

There used to be some dependencies like Breeze for example in the api
hidden now (eg.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala),
but of course this is a hint not a list of what is available for your use
case. Mind though that this may not be the cleanest way to implement your
use case or might sound a hack ;)

As an alternative choice besides spark local you could use a job server (
https://github.com/spark-jobserver/spark-jobserver) to integrate with your
app as a proxy for spark or have a spark service there to respond back with
results. From a design point of you its best to separate the concerns for
several reasons: scaling , utilization etc.

On Sun, Nov 22, 2015 at 5:03 AM, Reynold Xin <r...@databricks.com> wrote:

> You can use MLlib and Spark directly without "installing anything". Just
> run Spark in local mode.
>
>
> On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski <ra...@gruchalski.com>
> wrote:
>
>> Bowen,
>>
>> What Andy is doing in the notebook is a slightly different thing. He’s
>> using sbt to bring all spark jars (core, mllib, repl, what have you). You
>> could use maven for that. He then creates a repl and submits all the spark
>> code into it.
>> Pretty sure spark unit tests cover similar uses cases. Maybe not mllib
>> per se but this kind of submission.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Sunday, 22 November 2015 at 01:01, bowen zhang wrote:
>>
>> Thanks Rad for info. I looked into the repo and see some .snb file using
>> spark mllib. Can you give me a more specific place to look for when
>> invoking the mllib functions? What if I just want to invoke some of the ML
>> functions in my HelloWorld.java?
>>
>> --
>> *From:* Rad Gruchalski <ra...@gruchalski.com>
>> *To:* bowen zhang <bowenzhang...@yahoo.com>
>> *Cc:* "dev@spark.apache.org" <dev@spark.apache.org>
>> *Sent:* Saturday, November 21, 2015 3:43 PM
>> *Subject:* Re: Using spark MLlib without installing Spark
>>
>> Bowen,
>>
>> One project to look at could be spark-notebook:
>> https://github.com/andypetrella/spark-notebook
>> It uses Spark you in the way you intend to use it.
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com <ra...@gruchalski.com>
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>>
>> On Sunday, 22 November 2015 at 00:38, bowen zhang wrote:
>>
>> Hi folks,
>> I am a big fan of Spark's Mllib package. I have a java web app where I
>> want to run some ml jobs inside the web app. My question is: is there a way
>> to just import spark-core and spark-mllib jars to invoke my ML jobs without
>> installing the entire Spark package? All the tutorials related Spark seems
>> to indicate installing Spark is a pre-condition for this.
>>
>> Thanks,
>> Bowen
>>
>>
>>
>>
>>
>>
>


-- 

Stavros Kontopoulos

<http://www.typesafe.com>


<http://www.typesafe.com>


[VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-29 Thread Stavros Kontopoulos
+1  (non binding)

I tested several of the examples on mesos latest version (fine and
coarse-grained modes) and they work fine. Hope not too late...though..

-- 

Stavros Kontopoulos

<http://www.typesafe.com>


<http://www.typesafe.com>