date:20170316

Jenkins build is back to normal : beam_Release_NightlySnapshot #359

2017-03-16 Thread Apache Jenkins Server

See

Re: Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

2017-03-16 Thread Davor Bonaci

I'd like to thank everyone for coming -- notes and summary of the
discussion are below.

If there's any feedback, ideas for improvement, requests to do this again
at some point, etc. -- please comment!

---

Attendees:
* Jason
* Etienne
* Kenn
* Neleesh
* Pramod
* Raghu
* Sergio
* Amit
* Aviem
* Stas
* Koby
* Thomas
* Mingmin
* Ismael
* JB
* Kai
* Frances
* Ahmet
* Robert
* Stephen
* Davor

Discussion topics:
* First stable release
* Upcoming conferences

First stable release:
* The next big milestone for the project -- it means Beam is ready for
prime time
* Timeline: April.
* JB: encourage people to test Beam; test different scenarios; get more
feedback from community
* Davor: more experiences deploying Beam; more polish around user experience
* Thomas: few kinks to be worked out; documentation; easier-to-understand
examples, HDFS in particular
* Mingmin: better documentation / examples.
* Aviem: run pipelines on clusters; had to use Dataflow documentation at
times
* Ismael: examples use too many GCP IOs
* Sergio: docker can help with elaborate setup for examples, including IOs

General:
* Amit: we should position our batch offering better: agility, IOs are
major advantage
* Sergio: performance benchmarking -- it will push the needle because
everyone wants to be on top
* Davor: versioning -- how to support multiple versions of the system we
interconnect with?
* Mingmin: SQL interface in the first version?
* Neelesh: pursue more use case blog posts
* Etienne: should we pursue other projects to maintain their connectors
with Beam?
* Raghu: usage of coders in IO?

Upcoming conferences:
* ApacheCon coming up in May -- schedule to be published shortly
* There'll be Beam talks as well as social gatherings -- everyone's invited!

Action items:
* [Davor] Compare Dataflow and Beam documentation, and report back
* [all] Examples vs. GCP IO
* [JB] Blog: Talend use case
* [Amit] Blog: PayPal use case
* [all] Investigate docker usage in examples

On Tue, Mar 7, 2017 at 1:46 AM, Sergio Fernández  wrote:

> Thanks, Davor!
>
> On Tue, Mar 7, 2017 at 3:20 AM, Davor Bonaci  wrote:
>
> > Link: https://hangouts.google.com/hangouts/_/google.com/beam-dev-mtg
> >
> > I'll try to be available on Slack shortly before the meeting, just in
> case
> > someone has trouble connecting.
> >
> > On Mon, Mar 6, 2017 at 9:27 AM, Amit Sela  wrote:
> >
> > > PayPal team will be there joined together.
> > >
> > > On Mon, Mar 6, 2017 at 7:23 PM Davor Bonaci  wrote:
> > >
> > > > Just a remainder that this is happening in about ~22 hours from now.
> > Hope
> > > > to see all of you there.
> > > >
> > > > On Thu, Mar 2, 2017 at 4:22 PM, Davor Bonaci 
> wrote:
> > > >
> > > > > I'd prefer not to record the video; just to keep things informal.
> > > We'll,
> > > > > however, keep the notes and share anything that may be relevant.
> > > > >
> > > > > On Thu, Mar 2, 2017 at 2:24 PM, Amit Sela 
> > > wrote:
> > > > >
> > > > >> I'll be there!
> > > > >>
> > > > >> On Thu, Mar 2, 2017 at 1:06 PM Aljoscha Krettek <
> > aljos...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >> > Shoot, I can't because I already have another meeting scheduled.
> > > Don't
> > > > >> mind
> > > > >> > me, though. Will you also maybe produce a video of the meeting?
> > > > >> >
> > > > >> > On Wed, 1 Mar 2017 at 21:50 Davor Bonaci 
> > wrote:
> > > > >> >
> > > > >> > > Hi everyone,
> > > > >> > > Based on the high demand [1], let's try to organize a virtual
> > > > >> contributor
> > > > >> > > meeting on Tuesday, March 7, 2017 at 15:00 UTC. For
> convenience,
> > > > >> calendar
> > > > >> > > link [2] and an .ics file are attached.
> > > > >> > >
> > > > >> > > I tried to accommodate as many time zones as possible, but I
> > know
> > > it
> > > > >> > might
> > > > >> > > be hard for some of us at 7 AM on the US west coast or 11 PM
> in
> > > > China.
> > > > >> > > Sorry about that.
> > > > >> > >
> > > > >> > > Let's use Google Hangouts as the video conferencing
> technology.
> > I
> > > > >> think
> > > > >> > we
> > > > >> > > may be limited to something like 30 participants, so I'd
> > encourage
> > > > any
> > > > >> > > co-located contributors to consider joining together (if
> > > > appropriate).
> > > > >> > > Joining the meeting should be straightforward -- please find
> the
> > > > link
> > > > >> > > within. No special requirements that I'm aware of.
> > > > >> > >
> > > > >> > > Just to re-state the expectations:
> > > > >> > > * This is totally optional and informal.
> > > > >> > > * It is simply a chance for everyone to meet others and see
> the
> > > > faces
> > > > >> of
> > > > >> > > people we share a common passion with.
> > > > >> > > * No specific agenda.
> > > > >> > > * An open discussion on any topic of interest to the
> contributor
> > > > >> > community
> > > > >> > > is
> > > > >> > > welcome -- please

Re: Beam spark 2.x runner status

2017-03-16 Thread Jean-Baptiste Onofré


Hi guys,

Yes, I started to experiment the profiles a bit and Amit and I plan to discuss 
about that during the week end.


Give me some time to move forward a bit and I will get back to you with more 
details.


Regards
JB

On 03/16/2017 05:15 PM, amarouni wrote:

Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.

The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:

I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.

I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
wrote:


Hi guys,

sorry, due to the time zone shift, I answer a bit late ;)

I think we can have the same runner dealing with the two major Spark
version, introducing some adapters. For instance, in CarbonData, we created
some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
dependencies come from Maven profiles. Of course, it's easier there as it's
more "user" code.

My proposal is just it's worth to try ;)

I just created a branch to experiment a bit and have more details.

Regards
JB


On 03/16/2017 02:31 AM, Amit Sela wrote:


I answered inline to Abbass' comment, but I think he hit something - how
about we have a branch with those adaptations ? same RDD implementation,
but depending on the latest 2.x version with the minimal changes required.
I'd be happy to do that, or guide anyone who wants to (I did most of it on
my branch for Spark 2 anyway) but since it's a branch and not on master (I
don't believe it "deserves" a place on master), it would always be a bit
behind since we would have to rebase and merge once in a while.

How does that sound ?

On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:

+1 for Spark runners based on different APIs RDD/Dataset and keeping the

Spark versions as a deployment dependency.

The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?

Good question!

I think this is the root cause of this problem - Spark 2 not only
introduced a new API, but also broke a few such as: context is now
session,
Accumulators are AccumulatorV2, and this is what I recall right now.
I don't think it's to hard to adapt those, and anyone who wants to could
see how I did it on my branch:
https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
5bae742d78a290cbbdc9




Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:


So you're suggesting we copy-paste the current runner and adapt whatever


is


necessary so it runs with Spark 2 ?
This also means any bug-fix / improvement would have to be maintained in
two runners, and I wouldn't wanna do that.

I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset


API.


Since the RDD API is mature, it should be the runner in master (not
preventing another runner once Dataset API is mature enough) and the
version (1.6.3 or 2.x) should be determined by the common installation.

That's why I believe we still need to leave things as they are, but
start
working on the Dataset API runner.
Otherwise, we'll have the current runner, another RDD API runner with


Spark


2, and a third one for the Dataset API. I don't want to maintain all of
them. It's a mess.

On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:

However, I do feel that we should use the Dataset API, starting with

batch


support first. WDYT ?


Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.

Re: Beam spark 2.x runner status

2017-03-16 Thread amarouni

Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.

The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:
> I'm personally in favor of maintaining one single branch, e.g.,
> spark-runner, which supports both Spark 1.6 & 2.1.
> Since there's currently no DataFrame support in spark 1.x runner, there
> should be no conflicts if we put two versions of Spark into one runner.
>
> I'm also +1 for adding adapters in the branch to support both Spark
> versions.
>
> Also, we can have two translators, say, 1.x translator which translates
> into RDDs & DataStreams and 2.x translator which translates into DataSets.
>
> On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> sorry, due to the time zone shift, I answer a bit late ;)
>>
>> I think we can have the same runner dealing with the two major Spark
>> version, introducing some adapters. For instance, in CarbonData, we created
>> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
>> dependencies come from Maven profiles. Of course, it's easier there as it's
>> more "user" code.
>>
>> My proposal is just it's worth to try ;)
>>
>> I just created a branch to experiment a bit and have more details.
>>
>> Regards
>> JB
>>
>>
>> On 03/16/2017 02:31 AM, Amit Sela wrote:
>>
>>> I answered inline to Abbass' comment, but I think he hit something - how
>>> about we have a branch with those adaptations ? same RDD implementation,
>>> but depending on the latest 2.x version with the minimal changes required.
>>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>>> don't believe it "deserves" a place on master), it would always be a bit
>>> behind since we would have to rebase and merge once in a while.
>>>
>>> How does that sound ?
>>>
>>> On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:
>>>
>>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
 Spark versions as a deployment dependency.

 The RDD API is stable & mature enough so it makes sense to have it on
 master, the Dataset API still have some work to do and from our own
 experience it just reached a comparable RDD API performance. The
 community is clearly heading in the Dataset API direction but the RDD
 API is still a viable option for most use cases.

 Just one quick question, today on master can we swap Spark 1.x by Spark
 2.x  and compile and use the Spark Runner ?

 Good question!
>>> I think this is the root cause of this problem - Spark 2 not only
>>> introduced a new API, but also broke a few such as: context is now
>>> session,
>>> Accumulators are AccumulatorV2, and this is what I recall right now.
>>> I don't think it's to hard to adapt those, and anyone who wants to could
>>> see how I did it on my branch:
>>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>>> 5bae742d78a290cbbdc9
>>>
>>>
>>>
 Thanks,

 Abbass,


 On 15/03/2017 17:57, Amit Sela wrote:

> So you're suggesting we copy-paste the current runner and adapt whatever
>
 is

> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset
>
 API.

> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but
> start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with
>
 Spark

> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:
>
> However, I do feel that we should use the Dataset API, starting with
>> batch
>>
>>> support first. WDYT ?
>>>
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark

Re: Performance Testing Next Steps

2017-03-16 Thread Ismaël Mejía

> .. if the provider we are bringing up also
> provides the data store, we can just omit the data store for that benchmark
> and use what we've already brought up. Does that answer your question, or
> have I misunderstood?

Yes, and it is a perfect approach for the case, great idea.

> Great point -- I neglected to include the DirectRunner in the plans here.
> I'll add it to the doc and file a JIRA.

Excellent.

This work is super interesting so don’t hesitate to ask anything from
us the rest of the community because I think there are many of us
interested and we can give a hand if needed.


On Thu, Mar 16, 2017 at 9:17 AM, Jason Kuster
 wrote:
> Thanks Ismael for the comments! Replied inline.
>
> On Wed, Mar 15, 2017 at 8:18 AM, Ismaël Mejía  wrote:
>
>> Excellent proposal, sorry to jump into this discussion so late, this
>> was in my toread list for almost two weeks, and I finally got the time
>> to read the document and I have two minor comments:
>>
>> I have the impression that the strict separation of Providers (the
>> data-processing systems) and Resources (the concrete Data Stores)
>> makes sense for the general case, but is lacking if what we want to
>> test are things in the Hadoop ecosystem where the data stores commonly
>> co-exist in the same group of machines with the data-processing
>> systems (the Providers), e.g. HDFS, Hbase + YARN. This is important to
>> correctly test that data locality works correctly for example. Have
>> you considered such case?
>>
>
> Definitely interesting to think about, and I don't think I added provisions
> for this in the doc. My impression, though, is that since the providers and
> the data stores are not coupled, if the provider we are bringing up also
> provides the data store, we can just omit the data store for that benchmark
> and use what we've already brought up. Does that answer your question, or
> have I misunderstood?
>
>>
>> Another thing I noticed is that in the list of runners supporting PKB
>> the Direct Runner is not included, is there any particular reason for
>> this? I think that even if performance is not the main goal of the
>> direct runner it can be nice to have it there too to catch any
>> performance regressions, or is it because it is already ready for it?
>> what do you think?
>>
>>
> Great point -- I neglected to include the DirectRunner in the plans here.
> I'll add it to the doc and file a JIRA.
>
>
>> Thanks,
>> Ismaël
>>
>> On Thu, Mar 2, 2017 at 11:49 PM, Amit Sela  wrote:
>> > Looks great, and I'll be sure to follow this. Ping me if I can assist in
>> > any way!
>> >
>> > On Fri, Mar 3, 2017 at 12:09 AM Ahmet Altay 
>> > wrote:
>> >
>> >> Sounds great, thank you!
>> >>
>> >> On Thu, Mar 2, 2017 at 1:41 PM, Jason Kuster > >> .invalid
>> >> > wrote:
>> >>
>> >> > D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for
>> >> Python
>> >> > in PKB against the Dataflow runner. Once the Fn API progresses some
>> more
>> >> we
>> >> > can add some work items for the other runners too. Let's chat about
>> this
>> >> > more, maybe next week?
>> >> >
>> >> > On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay > >
>> >> > wrote:
>> >> >
>> >> > > Thank you Jason, this is great.
>> >> > >
>> >> > > Which one of these issues fall into the land of sdk-py?
>> >> > >
>> >> > > Ahmet
>> >> > >
>> >> > > On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
>> >> > > jasonkus...@google.com.invalid> wrote:
>> >> > >
>> >> > > > Glad to hear the excitement. :)
>> >> > > >
>> >> > > > Filed BEAM-1595 - 1609 to track work items. Some of these fall
>> under
>> >> > > runner
>> >> > > > components, please feel free to reach out to me if you have any
>> >> > questions
>> >> > > > about how to accomplish these.
>> >> > > >
>> >> > > > Best,
>> >> > > >
>> >> > > > Jason
>> >> > > >
>> >> > > > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek <
>> >> aljos...@apache.org>
>> >> > > > wrote:
>> >> > > >
>> >> > > > > Thanks for writing this and taking care of this, Jason!
>> >> > > > >
>> >> > > > > I'm afraid I also cannot add anything except that I'm excited to
>> >> see
>> >> > > some
>> >> > > > > results from this.
>> >> > > > >
>> >> > > > > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles
>> > >> >
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > Just got a chance to look this over. I don't have anything to
>> add,
>> >> > but
>> >> > > > I'm
>> >> > > > > pretty excited to follow this project. Have the JIRAs been filed
>> >> > since
>> >> > > > you
>> >> > > > > shared the doc?
>> >> > > > >
>> >> > > > > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
>> >> > > > > jasonkus...@google.com.invalid> wrote:
>> >> > > > >
>> >> > > > > > Hey all, just wanted to pop this up again for people -- if
>> anyone
>> >> > has
>> >> > > > > > thoughts on performance testing please feel welcome

Build failed in Jenkins: beam_Release_NightlySnapshot #358

2017-03-16 Thread Apache Jenkins Server

See 


Changes:

[jbonofre] [BEAM-1660] Update JdbcIO JavaDoc about withCoder() use to ensure the

[altay] [BEAM-547] Version should be accessed from pom file

[tgroh] Add Create.TimestampedValues.withType

[davor] Add ValueProvider options for DatastoreIO

[altay] Add ValueProvider class for FileBasedSource I/O Transforms

[davor] Increment shade-plugin version back to 3.0.0

[altay] Revert "[BEAM-547] Version should be accessed from pom file"

[altay] Revert "Add ValueProvider class for FileBasedSource I/O Transforms"

[tgroh] A few comment fixups in BigQueryIO

[tgroh] Disable Guava Shading in Google Cloud Platform IOs

--
[...truncated 794.99 KB...]
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
[JENKINS] Archiving disabled
2017-03-16T07:11:28.248 [INFO] 

2017-03-16T07:11:28.248 [INFO] Reactor Summary:
2017-03-16T07:11:28.248 [INFO] 
2017-03-16T07:11:28.248 [INFO] Apache Beam :: Parent 
.. SUCCESS [ 23.588 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Java :: Build Tools 
. SUCCESS [ 14.524 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs 
 SUCCESS [  7.059 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Common 
.. SUCCESS [  3.333 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Common :: Fn API 
 SUCCESS [ 19.933 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Common :: Runner API 
 SUCCESS [ 17.019 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Java 
 SUCCESS [  3.541 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: SDKs :: Java :: Core 
 SUCCESS [02:41 min]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: Runners 
. SUCCESS [  3.613 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: Runners :: Core Java Construction 
... SUCCESS [ 17.816 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: Runners :: Core Java 
 SUCCESS [ 53.869 s]
2017-03-16T07:11:28.248 [INFO] Apache Beam :: Runners :: Direct Java 
.. SUCCESS [02:30 min]
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO 
.. SUCCESS [  3.211 s]
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: 
Elasticsearch . SUCCESS [ 38.728 s]
2017-03-16T07:11:28.249 [INFO] Apache Beam :: Runners :: Google Cloud Dataflow 
 SUCCESS [ 34.025 s]
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: Google 
Cloud Platform FAILURE [ 49.343 s]
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: Hadoop 
Common . SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: HBase 
. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: HDFS 
.. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: JDBC 
.. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: JMS 
... SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: Kafka 
. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: Kinesis 
... SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: MongoDB 
... SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: IO :: MQTT 
.. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes 
 SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes 
:: Starter SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes 
:: Examples SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes 
:: Examples - Java 8 SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Extensions 
.. SKIPPED
2017-03-16T07:11:28.249 [INFO] Apache Beam :: SDKs :: Java :: Extensions

Re: Beam spark 2.x runner status

2017-03-16 Thread Cody Innowhere

I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.

I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> sorry, due to the time zone shift, I answer a bit late ;)
>
> I think we can have the same runner dealing with the two major Spark
> version, introducing some adapters. For instance, in CarbonData, we created
> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
> dependencies come from Maven profiles. Of course, it's easier there as it's
> more "user" code.
>
> My proposal is just it's worth to try ;)
>
> I just created a branch to experiment a bit and have more details.
>
> Regards
> JB
>
>
> On 03/16/2017 02:31 AM, Amit Sela wrote:
>
>> I answered inline to Abbass' comment, but I think he hit something - how
>> about we have a branch with those adaptations ? same RDD implementation,
>> but depending on the latest 2.x version with the minimal changes required.
>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>> don't believe it "deserves" a place on master), it would always be a bit
>> behind since we would have to rebase and merge once in a while.
>>
>> How does that sound ?
>>
>> On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:
>>
>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
>>> Spark versions as a deployment dependency.
>>>
>>> The RDD API is stable & mature enough so it makes sense to have it on
>>> master, the Dataset API still have some work to do and from our own
>>> experience it just reached a comparable RDD API performance. The
>>> community is clearly heading in the Dataset API direction but the RDD
>>> API is still a viable option for most use cases.
>>>
>>> Just one quick question, today on master can we swap Spark 1.x by Spark
>>> 2.x  and compile and use the Spark Runner ?
>>>
>>> Good question!
>> I think this is the root cause of this problem - Spark 2 not only
>> introduced a new API, but also broke a few such as: context is now
>> session,
>> Accumulators are AccumulatorV2, and this is what I recall right now.
>> I don't think it's to hard to adapt those, and anyone who wants to could
>> see how I did it on my branch:
>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>> 5bae742d78a290cbbdc9
>>
>>
>>
>>> Thanks,
>>>
>>> Abbass,
>>>
>>>
>>> On 15/03/2017 17:57, Amit Sela wrote:
>>>
 So you're suggesting we copy-paste the current runner and adapt whatever

>>> is
>>>
 necessary so it runs with Spark 2 ?
 This also means any bug-fix / improvement would have to be maintained in
 two runners, and I wouldn't wanna do that.

 I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset

>>> API.
>>>
 Since the RDD API is mature, it should be the runner in master (not
 preventing another runner once Dataset API is mature enough) and the
 version (1.6.3 or 2.x) should be determined by the common installation.

 That's why I believe we still need to leave things as they are, but
 start
 working on the Dataset API runner.
 Otherwise, we'll have the current runner, another RDD API runner with

>>> Spark
>>>
 2, and a third one for the Dataset API. I don't want to maintain all of
 them. It's a mess.

 On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:

 However, I do feel that we should use the Dataset API, starting with
>>
> batch
>
>> support first. WDYT ?
>>
> Well, this is the exact current status quo, and it will take us some
> time to have something as complete as what we have with the spark 1
> runner for the spark 2.
>
> The other proposal has two advantages:
>
> One is that we can leverage on the existing implementation (with the
> needed adjustments) to run Beam pipelines on Spark 2, in the end final
> users don’t care so much if pipelines are translated via RDD/DStream
> or Dataset, they just want to know that with Beam they can run their
> code in their favorite data processing framework.
>
> The other advantage is that we can base the work on the latest spark
> version and advance simultaneously in translators for both APIs, and
> once we consider that the DataSet is mature enough we can stop
> maintaining the RDD one and make it the official one.
>
> The only missing piece is backporting new developments on the RDD

Jenkins build is back to normal : beam_Release_NightlySnapshot #359

Re: Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

Re: Beam spark 2.x runner status

Re: Beam spark 2.x runner status

Re: Performance Testing Next Steps

Build failed in Jenkins: beam_Release_NightlySnapshot #358

Re: Beam spark 2.x runner status

7 matches

Site Navigation

Mail list logo

Footer information