date:20170821

[RESULT][VOTE] Release 2.1.0, release candidate #3

2017-08-21 Thread Jean-Baptiste Onofré

Hi

This vote passed with only +1.

I'm promoting the artifacts to central and update Jira.

As I'm in vacation can a committer deal with the tag and website or merge ?

Sorry for this very short e-mail. Thanks all for your vote.

Regards
JB


On Aug 18, 2017, 18:43, at 18:43, "Jean-Baptiste Onofré"  
wrote:
>Hi
>
>I'm in vacation so I'm looking for a decent Internet connection to
>finalize the release.
>
>I keep you posted.
>
>Regards
>JB
>
>On Aug 18, 2017, 17:48, at 17:48, Eugene Kirpichov
> wrote:
>>Hi JB,
>>
>>Any updates on finalizing the release?
>>
>>Thanks.
>>
>>On Thu, Aug 17, 2017 at 5:42 AM Aljoscha Krettek 
>>wrote:
>>
>>> (Belated) +1
>>>
>>>  * verified signatures
>>>  * verified that Quickstart works with Flink Runner
>>>
>>> > On 16. Aug 2017, at 20:41, Robert Bradshaw
>>
>>> wrote:
>>> >
>>> > +1 binding
>>> >
>>> > (I've been on vacation as well.)
>>> >
>>> > On Wed, Aug 16, 2017 at 8:50 AM, Lukasz Cwik
>>
>>> wrote:
>>> >> Back from vacation.
>>> >>
>>> >> +1 binding
>>> >>
>>> >> BEAM-2671 has been marked for 2.2.0 release.
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 16, 2017 at 2:08 AM, Kobi Salant
>>
>>> wrote:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> Spark runner was tested with word count example and a more
>>complex
>>> session
>>> >>> based application on a yarn cluster.
>>> >>> Both application run successfully so we can say that spark
>runner
>>> passed
>>> >>> the sanity tests needed.
>>> >>>
>>> >>> Still there is an open ticket
>>> >>> https://issues.apache.org/jira/browse/BEAM-2671 which Stas is
>>working
>>> on
>>> >>> and its implications should be taken into consideration
>regarding
>>the
>>> >>> release.
>>> >>>
>>> >>> Regards
>>> >>> Kobi
>>> >>>
>>> >>> 2017-08-16 5:02 GMT+03:00 Eugene Kirpichov
>>> :
>>> >>>
>>>  Hey all,
>>> 
>>>  Seems like we're missing one more affirmative vote from a PMC
>>member
>>> (so
>>>  far we have JB and Ahmet) to proceed with the release.
>>> 
>>>  On Mon, Aug 14, 2017 at 9:30 AM Ahmet Altay
 >
>>>  wrote:
>>> 
>>> > On Mon, Aug 14, 2017 at 6:32 AM, Ismaël Mejía
>>
>>> >>> wrote:
>>> >
>>> >> +1 (non-binding)
>>> >>
>>> >> - Validated signatures OK
>>> >> - mvn clean verify -Prelease on both OpenJDK 1.7 and Oracle
>>JDK 8
>>> >>> with
>>> >> the docker development images (WIP), both OK
>>> >> - Run WordCount on local Flink and Spark runners OK
>>> >>
>>> >> Everything looks nice, only one minor thing (not blocking at
>>all).
>>> >>> The
>>> >> proto generated files for python are not cleaned correctly
>and
>>this
>>> >> causes the validation to complain because the maven rat
>plugin
>>does
>>> >> not find the apache headers on the files  (this happens if
>you
>>> >>> execute
>>> >> mvn clean verify -Prelease immediately after the validation).
>>> >>
>>> >
>>> > Ismaël, could you create a JIRA issue for this (to be fixed at
>>a
>>> future
>>> > release)?
>>> >
>>> >
>>> >>
>>> >> On Sun, Aug 13, 2017 at 6:52 AM, Jean-Baptiste Onofré <
>>> >>> j...@nanthrax.net
>>> >
>>> >> wrote:
>>> >>> +1 (binding)
>>> >>>
>>> >>> I do my own tests and casting my own vote ;)
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On 08/09/2017 07:08 AM, Jean-Baptiste Onofré wrote:
>>> 
>>>  Hi everyone,
>>> 
>>>  Please review and vote on the release candidate #3 for the
>>version
>>> >> 2.1.0,
>>>  as follows:
>>> 
>>>  [ ] +1, Approve the release
>>>  [ ] -1, Do not approve the release (please provide specific
>>>  comments)
>>> 
>>> 
>>>  The complete staging area is available for your review,
>>which
>>> > includes:
>>>  * JIRA release notes [1],
>>>  * the official Apache source release to be deployed to
>>> > dist.apache.org
>>>  [2], which is signed with the key with fingerprint C8282E76
>>[3],
>>>  * all artifacts to be deployed to the Maven Central
>>Repository
>>> >>> [4],
>>>  * source code tag "v2.1.0-RC3" [5],
>>>  * website pull request listing the release and publishing
>>the API
>>>  reference manual [6].
>>>  * Python artifacts are deployed along with the source
>>release to
>>> >>> the
>>>  dist.apache.org [2].
>>> 
>>>  The vote will be open for at least 72 hours. It is adopted
>>by
>>>  majority
>>>  approval, with at least 3 PMC affirmative votes.
>>> 
>>>  Thanks,
>>>  JB
>>> 
>>>  [1]
>>>  https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> >> projectId=12319527&version=12340528
>>>  [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
>>>  [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>  [4] https://repository.apache.org/content/repositories/
>>> >> orgapachebeam-1020/
>>>

How to Retain File Name while using TextIO for pattern

2017-08-21 Thread Siddharth Mittal

Hi Team,

We want to retain the File name while reading a zip file using TextIO api.

When we read a Zip file using TextIO API we get PCollection of all lines of all 
files  but the file name is not present .

If we have a Zip file which contains four files inside that lets say  file1.csv 
, file2.csv , file3.csv and file4.csv.

In output we want PCollection 


Please suggest .

Thanks & Regards

Siddharth Mittal
Senior Associate | Sapient
Gurgaon SEZ | India
Mobile  : +91-987-391-9917

How to read files in distributed way from a pcollection

2017-08-21 Thread Siddharth Mittal

Hi Team,

I have a use case where I will get a PCollection of file names.

Files are present on NFS and file size may wary from few KBs to few GBs.

We want to transform PCollection of File Names to PCollection of 

Please Suggest how to handle this type of use case.

Thanks & Regards

Siddharth Mittal
Senior Associate | Sapient

Re: Beam spark 2.x runner status

2017-08-21 Thread Holden Karau

I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).

On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi
>
> I did a new runner supporting spark 2.1.x. I changed code for that.
>
> I'm still in vacation this week. I will send an update when back.
>
> Regards
> JB
>
> On Aug 21, 2017, 09:01, at 09:01, Pei HE  wrote:
> >Any updates for upgrading to spark 2.x?
> >
> >I tried to replace the dependency and found a compile error from
> >implementing a scala trait:
> >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
> >abstract
> >and does not override abstract method
> >org$apache$spark$Partition$$super$equals(java.lang.Object) in
> >org.apache.spark.Partition
> >
> >(The spark side change was introduced in
> >https://github.com/apache/spark/pull/12157.)
> >
> >Does anyone have ideas about this compile error?
> >
> >
> >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
> >wrote:
> >
> >> Hi Ted,
> >>
> >> My branch used Spark 2.1.0 and I just updated to 2.1.1.
> >>
> >> As discussed with Aviem, I should be able to create the pull request
> >later
> >> today.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/03/2017 02:50 AM, Ted Yu wrote:
> >>
> >>> Spark 2.1.1 has been released.
> >>>
> >>> Consider using the new release in this work.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
> >
> >>> wrote:
> >>>
> >>> Cool for the PR merge, I will rebase my branch on it.
> 
>  Thanks !
>  Regards
>  JB
> 
> 
>  On 03/29/2017 01:58 PM, Amit Sela wrote:
> 
>  @Ted definitely makes sense.
> > @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
> >any
> > deprecated Spark API issues should be resolved.
> >
> > On Wed, Mar 29, 2017 at 2:46 PM Ted Yu 
> >wrote:
> >
> > This is what I did over HBASE-16179:
> >
> >>
> >> -f.call((asJavaIterator(it), conn)).iterator()
> >> +// the return type is different in spark 1.x & 2.x, we
> >handle
> >> both
> >> cases
> >> +f.call(asJavaIterator(it), conn) match {
> >> +  // spark 1.x
> >> +  case iterable: Iterable[R] => iterable.iterator()
> >> +  // spark 2.x
> >> +  case iterator: Iterator[R] => iterator
> >> +}
> >>)
> >>
> >> FYI
> >>
> >> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
> >> wrote:
> >>
> >> Just tried to replace dependencies and see what happens:
> >>
> >>>
> >>> Most required changes are about the runner using deprecated
> >Spark
> >>> APIs,
> >>>
> >>> and
> >>
> >> after fixing them the only real issue is with the Java API for
> >>> Pair/FlatMapFunction that changed return value to Iterator (in
> >1.6 its
> >>> Iterable).
> >>>
> >>> So I'm not sure that a profile that simply sets dependency on
> >>> 1.6.3/2.1.0
> >>> is feasible.
> >>>
> >>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
> >
> >>> wrote:
> >>>
> >>> So, if everything is in place in Spark 2.X and we use provided
> >>>
> 
>  dependencies
> >>>
> >>> for Spark in Beam.
>  Theoretically, you can run the same code in 2.X without any
> >need for
>  a
>  branch?
> 
>  2017-03-23 9:47 GMT+02:00 Amit Sela :
> 
>  If StreamingContext is valid and we don't have to use
> >SparkSession,
> 
> >
> > and
> 
> >>>
> >> Accumulators are valid as well and we don't need AccumulatorsV2,
> >I
> >>>
> 
> > don't
> 
> >>>
> >>> see a reason this shouldn't work (which means there are still
> >tons of
> 
> > reasons this could break, but I can't think of them off the
> >top of
> > my
> >
> > head
> 
>  right now).
> >
> > @JB simply add a profile for the Spark dependencies and run
> >the
> >
> > tests -
> 
> >>>
> >> you'll have a very definitive answer ;-) .
> >>>
>  If this passes, try on a cluster running Spark 2 as well.
> >
> > Let me know of I can assist.
> >
> > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> >
> > j...@nanthrax.net>
> 
> >>>
> >> wrote:
> >>>
> 
> > Hi guys,
> >
> >>
> >> Ismaël summarize well what I have in mind.
> >>
> >> I'm a bit late on the PoC around that (I started a branch
> >already).
> >> I will move forward over the week end.
> >>
> >> Regards
> >> JB
> >>
> >> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> >>
> >> Amit, I suppose JB is talking about the RDD based version, so
> >no
> >>>
> >>> need
> >>
> >
>

Re: Beam spark 2.x runner status

2017-08-21 Thread Jean-Baptiste Onofré

Hi

I did a new runner supporting spark 2.1.x. I changed code for that.

I'm still in vacation this week. I will send an update when back.

Regards
JB

On Aug 21, 2017, 09:01, at 09:01, Pei HE  wrote:
>Any updates for upgrading to spark 2.x?
>
>I tried to replace the dependency and found a compile error from
>implementing a scala trait:
>org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
>abstract
>and does not override abstract method
>org$apache$spark$Partition$$super$equals(java.lang.Object) in
>org.apache.spark.Partition
>
>(The spark side change was introduced in
>https://github.com/apache/spark/pull/12157.)
>
>Does anyone have ideas about this compile error?
>
>
>On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
>wrote:
>
>> Hi Ted,
>>
>> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>>
>> As discussed with Aviem, I should be able to create the pull request
>later
>> today.
>>
>> Regards
>> JB
>>
>>
>> On 05/03/2017 02:50 AM, Ted Yu wrote:
>>
>>> Spark 2.1.1 has been released.
>>>
>>> Consider using the new release in this work.
>>>
>>> Thanks
>>>
>>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
>
>>> wrote:
>>>
>>> Cool for the PR merge, I will rebase my branch on it.

 Thanks !
 Regards
 JB


 On 03/29/2017 01:58 PM, Amit Sela wrote:

 @Ted definitely makes sense.
> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
>any
> deprecated Spark API issues should be resolved.
>
> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu 
>wrote:
>
> This is what I did over HBASE-16179:
>
>>
>> -f.call((asJavaIterator(it), conn)).iterator()
>> +// the return type is different in spark 1.x & 2.x, we
>handle
>> both
>> cases
>> +f.call(asJavaIterator(it), conn) match {
>> +  // spark 1.x
>> +  case iterable: Iterable[R] => iterable.iterator()
>> +  // spark 2.x
>> +  case iterator: Iterator[R] => iterator
>> +}
>>)
>>
>> FYI
>>
>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
>> wrote:
>>
>> Just tried to replace dependencies and see what happens:
>>
>>>
>>> Most required changes are about the runner using deprecated
>Spark
>>> APIs,
>>>
>>> and
>>
>> after fixing them the only real issue is with the Java API for
>>> Pair/FlatMapFunction that changed return value to Iterator (in
>1.6 its
>>> Iterable).
>>>
>>> So I'm not sure that a profile that simply sets dependency on
>>> 1.6.3/2.1.0
>>> is feasible.
>>>
>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
>
>>> wrote:
>>>
>>> So, if everything is in place in Spark 2.X and we use provided
>>>

 dependencies
>>>
>>> for Spark in Beam.
 Theoretically, you can run the same code in 2.X without any
>need for
 a
 branch?

 2017-03-23 9:47 GMT+02:00 Amit Sela :

 If StreamingContext is valid and we don't have to use
>SparkSession,

>
> and

>>>
>> Accumulators are valid as well and we don't need AccumulatorsV2,
>I
>>>

> don't

>>>
>>> see a reason this shouldn't work (which means there are still
>tons of

> reasons this could break, but I can't think of them off the
>top of
> my
>
> head

 right now).
>
> @JB simply add a profile for the Spark dependencies and run
>the
>
> tests -

>>>
>> you'll have a very definitive answer ;-) .
>>>
 If this passes, try on a cluster running Spark 2 as well.
>
> Let me know of I can assist.
>
> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>
> j...@nanthrax.net>

>>>
>> wrote:
>>>

> Hi guys,
>
>>
>> Ismaël summarize well what I have in mind.
>>
>> I'm a bit late on the PoC around that (I started a branch
>already).
>> I will move forward over the week end.
>>
>> Regards
>> JB
>>
>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>
>> Amit, I suppose JB is talking about the RDD based version, so
>no
>>>
>>> need
>>
>
>>> to worry about SparkSession or different incompatible APIs.

>
>>> Remember the idea we are discussing is to have in master
>both the
>>> spark 1 and spark 2 runners using the RDD based translation.
>At
>>>
>>> the
>>
>
>> same time we can have a feature branch to evolve the DataSet
>>>

>>> based
>>
>
>> translator (this one will replace the RDD based translator for
>>>

>>> sp

Re: [Proposal] Progress Reporting in Fn API

2017-08-21 Thread Vikas RK

Hi,

  I have updated the proposal
 based on the comments
received. The major change is that the SDK no longer reports cumulative
backlog, but includes more details for each transform itself. This provides
a Runner more information about each transform in the fused sub-graph,
while still being able to compute bundle progress and backlog. Thanks to
Robert, Eugene and Luke for some key insights.

Prototyping is in progress, will share with you all when I have something
concrete.

Regards,
Vikas

On 17 July 2017 at 13:24, Kenneth Knowles  wrote:

> This seems really well thought-out. A useful read for anyone interested in
> Splittable DoFn, too.
>
> On Thu, Jul 13, 2017 at 11:34 AM, Vikas RK  wrote:
>
> > Hi,
> >
> > I wanted to share a drafted proposal for Progress Reporting in Fn API.
> >
> >  https://s.apache.org/beam-fn-api-progress-reporting
> >  > 2Fbeam-fn-api-progress-reporting&sa=D&sntz=1&usg=AFQjCNHXxzlsB-
> > 9oogQIr6Hril9c67SdEw>
> > .
> >
> > Would like to get comments and feedback from the community.
> >
> > Regards,
> > Vikas
> >
>

Re: [DISCUSS] Capability Matrix revamp

2017-08-21 Thread Tyler Akidau

Is there any way we could add quantitative runner metrics to this as well?
Like by having some benchmarks that process X amount of data, and then
detailing in the matrix latency, throughput, and (where possible) cost,
etc, numbers for each of the given runners? Semantic support is one thing,
but there are other differences between runners that aren't captured by
just checking feature boxes. I'd be curious if anyone has other ideas in
this vein as well. The benchmark idea might not be the best way to go about
it.

-Tyler

On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson 
wrote:

> It'd be awesome to see these updated. I'd add two more:
>
>1. A plain English summary of the runner's support in Beam. People who
>are new to Beam won't understand the in-depth coverage and need a
> general
>idea of how it is supported.
>2. The production readiness of the runner. Does the maintainer think
>this runner is production ready?
>
>
>
> On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles 
> wrote:
>
> > Hi all,
> >
> > I want to revamp
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > When Beam first started, we didn't work on feature branches for the core
> > runners, and they had a lot more gaps compared to what goes on `master`
> > today, so this tracked our progress in a way that was easy for users to
> > read. Now it is still our best/only comparison page for users, but I
> think
> > we could improve its usefulness.
> >
> > For the benefit of the thread, let me inline all the capabilities fully
> > here:
> >
> > 
> >
> > "What is being computed?"
> >  - ParDo
> >  - GroupByKey
> >  - Flatten
> >  - Combine
> >  - Composite Transforms
> >  - Side Inputs
> >  - Source API
> >  - Splittable DoFn
> >  - Metrics
> >  - Stateful Processing
> >
> > "Where in event time?"
> >  - Global windows
> >  - Fixed windows
> >  - Sliding windows
> >  - Session windows
> >  - Custom windows
> >  - Custom merging windows
> >  - Timestamp control
> >
> > "When in processing time?"
> >  - Configurable triggering
> >  - Event-time triggers
> >  - Processing-time triggers
> >  - Count triggers
> >  - [Meta]data driven triggers
> >  - Composite triggers
> >  - Allowed lateness
> >  - Timers
> >
> > "How do refinements relate?"
> >  - Discarding
> >  - Accumulating
> >  - Accumulating & Retracting
> >
> > 
> >
> > Here are some issues I'd like to improve:
> >
> >  - Rows that are impossible to not support (ParDo)
> >  - Rows where "support" doesn't really make sense (Composite transforms)
> >  - Rows are actually the same model feature (non-merging windowfns)
> >  - Rows that represent optimizations (Combine)
> >  - Rows in the wrong place (Timers)
> >  - Rows have not been designed ([Meta]Data driven triggers)
> >  - Rows with names that appear no where else (Timestamp control)
> >  - No place to compare non-model differences between runners
> >
> > I'm still pondering how to improve this, but I thought I'd send the
> notion
> > out for discussion. Some imperfect ideas I've had:
> >
> > 1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into one
> row
> > 2. Make sections as users see them, like "ParDo" / "side Inputs" not
> > "What?" / "side inputs"
> > 3. Add rows for non-model things, like portability framework support,
> > metrics backends, etc
> > 4. Drop rows that are not informative, like Composite transforms, or not
> > designed
> > 5. Reorganize the windowing section to be just support for merging /
> > non-merging windowing.
> > 6. Switch to a more distinct color scheme than the solid vs faded colors
> > currently used.
> > 7. Find a web design to get short descriptions into the foreground to
> make
> > it easier to grok.
> >
> > These are just a few thoughts, and not necessarily compatible with each
> > other. What do you think?
> >
> > Kenn
> >
> --
> Thanks,
>
> Jesse
>

Re: Beam spark 2.x runner status

2017-08-21 Thread Pei HE

Any updates for upgrading to spark 2.x?

I tried to replace the dependency and found a compile error from
implementing a scala trait:
org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not abstract
and does not override abstract method
org$apache$spark$Partition$$super$equals(java.lang.Object) in
org.apache.spark.Partition

(The spark side change was introduced in
https://github.com/apache/spark/pull/12157.)

Does anyone have ideas about this compile error?


On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
wrote:

> Hi Ted,
>
> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>
> As discussed with Aviem, I should be able to create the pull request later
> today.
>
> Regards
> JB
>
>
> On 05/03/2017 02:50 AM, Ted Yu wrote:
>
>> Spark 2.1.1 has been released.
>>
>> Consider using the new release in this work.
>>
>> Thanks
>>
>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Cool for the PR merge, I will rebase my branch on it.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>
>>> @Ted definitely makes sense.
 @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
 deprecated Spark API issues should be resolved.

 On Wed, Mar 29, 2017 at 2:46 PM Ted Yu  wrote:

 This is what I did over HBASE-16179:

>
> -f.call((asJavaIterator(it), conn)).iterator()
> +// the return type is different in spark 1.x & 2.x, we handle
> both
> cases
> +f.call(asJavaIterator(it), conn) match {
> +  // spark 1.x
> +  case iterable: Iterable[R] => iterable.iterator()
> +  // spark 2.x
> +  case iterator: Iterator[R] => iterator
> +}
>)
>
> FYI
>
> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
> wrote:
>
> Just tried to replace dependencies and see what happens:
>
>>
>> Most required changes are about the runner using deprecated Spark
>> APIs,
>>
>> and
>
> after fixing them the only real issue is with the Java API for
>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>> Iterable).
>>
>> So I'm not sure that a profile that simply sets dependency on
>> 1.6.3/2.1.0
>> is feasible.
>>
>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant 
>> wrote:
>>
>> So, if everything is in place in Spark 2.X and we use provided
>>
>>>
>>> dependencies
>>
>> for Spark in Beam.
>>> Theoretically, you can run the same code in 2.X without any need for
>>> a
>>> branch?
>>>
>>> 2017-03-23 9:47 GMT+02:00 Amit Sela :
>>>
>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>

 and
>>>
>>
> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>
>>>
 don't
>>>
>>
>> see a reason this shouldn't work (which means there are still tons of
>>>
 reasons this could break, but I can't think of them off the top of
 my

 head
>>>
>>> right now).

 @JB simply add a profile for the Spark dependencies and run the

 tests -
>>>
>>
> you'll have a very definitive answer ;-) .
>>
>>> If this passes, try on a cluster running Spark 2 as well.

 Let me know of I can assist.

 On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <

 j...@nanthrax.net>
>>>
>>
> wrote:
>>
>>>
 Hi guys,

>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>
> Amit, I suppose JB is talking about the RDD based version, so no
>>
>> need
>

>> to worry about SparkSession or different incompatible APIs.
>>>

>> Remember the idea we are discussing is to have in master both the
>> spark 1 and spark 2 runners using the RDD based translation. At
>>
>> the
>

> same time we can have a feature branch to evolve the DataSet
>>
>>>
>> based
>

> translator (this one will replace the RDD based translator for
>>
>>>
>> spark
>

>> 2
>>>
>>> once it is mature).

>
>> The advantages have been already discussed as well as the
>>
>> possible
>

> issues so I think we have to see now if JB's idea is feasible and
>>
>>>
>> how
>

>> hard would be to live with this while the DataSet version
>>>
>>>

[RESULT][VOTE] Release 2.1.0, release candidate #3

How to Retain File Name while using TextIO for pattern

How to read files in distributed way from a pcollection

Re: Beam spark 2.x runner status

Re: Beam spark 2.x runner status

Re: [Proposal] Progress Reporting in Fn API

Re: [DISCUSS] Capability Matrix revamp

Re: Beam spark 2.x runner status

8 matches

Site Navigation

Mail list logo

Footer information