Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-07-09 Thread Jean-Baptiste Onofré
The PR has been merged, it looks good to me.

Thanks !

Regards
JB

On 10/07/2018 01:18, Lukasz Cwik wrote:
> Actually, it hasn't been reviewed yet. Here is the
> PR: https://github.com/apache/beam/pull/5571
> 
> On Mon, Jul 9, 2018 at 4:16 PM Lukasz Cwik  > wrote:
> 
> Thats great, I have a PR here that has been
> reviewed: https://github.com/apache/beam/pull/5571
> 
> Will merge.
> 
> On Mon, Jul 9, 2018 at 4:15 PM Mark Liu  > wrote:
> 
> The work that moving Dataflow build off Maven pom is done in
> Google internal, so this should no longer be the blocker for pom
> deletion on Beam. 
> 
> Considering it's been two weeks after 2.5 release, I propose to
> resume Maven pom cleanup if no known blockers.
> 
> Mark
> 
> 
> On Tue, Jun 19, 2018 at 9:52 AM Kenneth Knowles  > wrote:
> 
> -user@ for misc. progress reporting
> 
> I took a look at it yesterday. Learned a lot, made some
> progress that turns out to just be general goodness, and
> left comment on the JIRA.
> 
> On Mon, Jun 18, 2018 at 6:37 PM Lukasz Cwik
> mailto:lc...@google.com>> wrote:
> 
> Any updates on BEAM-4512?
> 
> On Mon, Jun 11, 2018 at 1:42 PM Lukasz Cwik
> mailto:lc...@google.com>> wrote:
> 
> Thanks all, it seems as though only Google needs the
> grace period. I'll wait for the shorter of BEAM-4512
> or two weeks before
> merging https://github.com/apache/beam/pull/5571
> 
> 
> On Wed, Jun 6, 2018 at 8:29 PM Kenneth Knowles
> mailto:k...@google.com>> wrote:
> 
> +1
> 
> Definitely a good opportunity to decouple your
> build tools from your dependencies' build tools.
> 
> On Wed, Jun 6, 2018 at 2:42 PM Ted Yu
>  > wrote:
> 
> +1 on this effort
> 
>  Original message 
> From: Chamikara Jayalath
>  >
> Date: 6/6/18 2:09 PM (GMT-08:00)
> To: dev@beam.apache.org
> ,
> u...@beam.apache.org
> 
> Subject: Re: [DISCUSS] [BEAM-4126] Deleting
> Maven build files (pom.xml) grace period?
> 
> +1 for the overall effort. As Pablo
> mentioned, we need some time to migrate
> internal Dataflow build off of Maven build
> files. I created
> https://issues.apache.org/jira/browse/BEAM-4512 
> for
> this.
> 
> Thanks,
> Cham
> 
> On Wed, Jun 6, 2018 at 1:30 PM Eugene
> Kirpichov  > wrote:
> 
> Is it possible for Dataflow to just keep
> a copy of the pom.xmls and delete it as
> soon as Dataflow is migrated?
> 
> Overall +1, I've been using Gradle
> without issues for a while and almost
> forgot pom.xml's still existed.
> 
> On Wed, Jun 6, 2018, 1:13 PM Pablo
> Estrada  > wrote:
> 
> I agree that we should delete the
> pom.xml files soon, as they create a
> burden for maintainers. 
> 
> I'd like to be able to extend the
> grace period by a bit, to allow the
> internal build systems at Google to
> move away from using the Beam poms.
> 
> We use these pom files to build
> Dataflow workers, and thus it's
> critical for us that they are
> available for a few more weeks while
> we set up a gradle build. Perhaps 

Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-09 Thread Jean-Baptiste Onofré
+1

I planned to send the proposal as well ;)

Regards
JB

On 09/07/2018 23:16, Pablo Estrada wrote:
> Hello everyone!
> 
> As per the previously agreed-upon schedule for Beam releases, the
> process for the 2.6.0 Beam release should start on July 17th.
> 
> I volunteer to perform this release. 
> 
> Here is the schedule that I have in mind:
> 
> - We start triaging JIRA issues this week.
> - I will cut a release branch on July 17.
> - After July 17, any blockers will need to be cherry-picked into the
> release branch.
> - As soon as tests look good, and blockers have been addressed, I will
> perform the other release tasks.
> 
> Does that seem reasonable to the community?
> 
> Best
> -P.
> -- 
> Got feedback? go/pabloem-feedback

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


CODEOWNERS for apache/beam repo

2018-07-09 Thread Udi Meiri
Hi everyone,

I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
mechanism.
Initial version is here: *https://github.com/apache/beam/pull/5909/files
*

I need help from the community in determining owners for each component.
Feel free to directly edit the PR (if you have permission) or add a comment.


Background
The idea is to:
1. Document good review candidates for each component.
2. Help choose reviewers using the auto-assignment mechanism. The
suggestion is in no way binding.


smime.p7s
Description: S/MIME Cryptographic Signature


[PROPOSAL] Prepare Beam 2.6.0 release

2018-07-09 Thread Pablo Estrada
Hello everyone!

As per the previously agreed-upon schedule for Beam releases, the process
for the 2.6.0 Beam release should start on July 17th.

I volunteer to perform this release.

Here is the schedule that I have in mind:

- We start triaging JIRA issues this week.
- I will cut a release branch on July 17.
- After July 17, any blockers will need to be cherry-picked into the
release branch.
- As soon as tests look good, and blockers have been addressed, I will
perform the other release tasks.

Does that seem reasonable to the community?

Best
-P.
-- 
Got feedback? go/pabloem-feedback


Re: Building the Java SDK container with Jib?

2018-07-09 Thread Andrew Pilloud
This sounds really cool! I spent a minute looking at our current container
code. We have a golang wrapper that bootstraps our Java SDK harness, which
isn't compatible with Jib's current feature set (you can't add custom files
or override the ENTRYPOINT). It might be quite a bit of work to move.

Andrew

On Mon, Jul 9, 2018 at 11:21 AM Eugene Kirpichov 
wrote:

> Hi,
>
> Apparently a new tool has come out that lets you build Java containers
> cheaply, without even having Docker installed:
> https://cloudplatform.googleblog.com/2018/07/introducing-jib-build-java-docker-images-better.html
>
>
> Anyone interested in giving it a shot, to have faster turnaround when
> making changes to the Java SDK harness?
>


Building the Java SDK container with Jib?

2018-07-09 Thread Eugene Kirpichov
Hi,

Apparently a new tool has come out that lets you build Java containers
cheaply, without even having Docker installed:
https://cloudplatform.googleblog.com/2018/07/introducing-jib-build-java-docker-images-better.html


Anyone interested in giving it a shot, to have faster turnaround when
making changes to the Java SDK harness?


Re: Beam Dependency Ownership

2018-07-09 Thread Yifan Zou
If you haven't already, please take a look at the Beam SDK Dependency
Ownership
and
sign up with any dependencies that you are familiar with. In case anyone
miss, there is a second tab for the Python SDK.

Thanks.

Yifan

On Thu, Jun 28, 2018 at 6:37 AM Tim Robertson 
wrote:

> Thanks for this Yifan,
> I've added my name to all Hadoop related dependencies, solr, along with es.
>
>
>
> On Thu, Jun 28, 2018 at 3:28 PM, Etienne Chauchot 
> wrote:
>
>> I've added myself and @Tim Robertson on elasticsearchIO related deps.
>>
>> Etienne
>>
>> Le mercredi 27 juin 2018 à 14:05 -0700, Chamikara Jayalath a écrit :
>>
>> It's mentioned under "Dependency declarations may identify owners that
>> are responsible for upgrading respective dependencies". Feel free to update
>> if you think more details should be added to it. I think it'll be easier if
>> we transfer data in spreadsheet to comments close to dependency
>> declarations instead of maintaining the spreadsheet (after we collect the
>> data). Otherwise we'll have to put an extra effort to make sure that the
>> spreadsheet, BeamModulePlugin, and Python setup.py are in sync. We can
>> decide on the exact format of the comment to make sure that automated tool
>> can easily parse the comment.
>>
>> - Cham
>>
>> On Wed, Jun 27, 2018 at 1:45 PM Yifan Zou  wrote:
>>
>> Thanks Scott, I will supplement the missing packages to the spreadsheet.
>> And, we expect this being kept up to date along with the Beam project
>> growth. Shall we mention this in the Dependency Guide page
>> , @Chamikara Jayalath
>> ?
>>
>> On Wed, Jun 27, 2018 at 11:17 AM Scott Wegner  wrote:
>>
>> Thanks for kicking off this process Yifan-- I'll add my name to some
>> dependencies I'm familiar with.
>>
>> Do you expect this to be a one-time process, or will we maintain the
>> owners over time? If we will maintain this list, it would be easier to keep
>> it up-to-date if it was closer to the code. i.e. perhaps each dependency
>> registration in the Gradle BeamModulePlugin [1] should include a list of
>> owners.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L325
>>
>> On Wed, Jun 27, 2018 at 8:52 AM Yifan Zou  wrote:
>>
>> Hi all,
>>
>> We now have the automated detections for Beam dependency updates and
>> sending a weekly report to dev mailing list. In order to address the
>> updates in time, we want to find owners for all dependencies of Beam, and
>> finally, Jira bugs will be automatically created and assigned to the owners
>> if actions need to be taken. We also welcome non-owners to upgrade
>> dependency packages, but only owners will receive the Jira tickets.
>>
>> Please review the spreadsheet Beam SDK Dependency Ownership
>> 
>>  and
>> sign off if you are familiar with any Beam dependencies and willing to
>> take in charge of them. It is definitely fine that a single package have
>> multiple owners. The more owners we have, the more helps we will get to
>> keep Beam dependencies in a healthy state.
>>
>> Thank you :)
>>
>> Regards.
>> Yifan
>>
>>
>> https://docs.google.com/spreadsheets/d/12NN3vPqFTBQtXBc0fg4sFIb9c_mgst0IDePB_0Ui8kE/edit?ts=5b32bec1#gid=0
>>
>>
>


Re: Invite to comment on the @RequiresStableInput design doc

2018-07-09 Thread Lukasz Cwik
I'm also thinking that it would be best to apply to the whole transform. So
side inputs, main inputs, timers and any future input constructs.



On Sat, Jul 7, 2018 at 2:00 PM Reuven Lax  wrote:

> I think the entire transform. There might be some use case for having only
> some inputs stable, but I can't think of any offhand.
>
> BTW, it so happens that with DataflowRunner all side inputs happen to be
> stable (though that's more of a side effect of implementation).
>
> On Tue, Jul 3, 2018 at 9:46 AM Lukasz Cwik  wrote:
>
>> Does it make sense to only have some inputs be stable for a transform or
>> for the entire transform to require stable inputs?
>>
>> On Tue, Jul 3, 2018 at 7:34 AM Kenneth Knowles  wrote:
>>
>>> Since we always assume ProcessElement could have arbitrary side effects
>>> (esp. randomization), the state and timers set up by a call to
>>> ProcessElement cannot be considered stable until they are persisted. It
>>> seems very similar to the cost of outputting to a downstream
>>> @RequiresStableInput transform, if not an identical implementation.
>>>
>>> The thing timers add is a way to loop which you can't do if it is an
>>> output.
>>>
>>> Adding @Pure annotations might help, if the input elements are stable
>>> and ProcessElement is pure.
>>>
>>> Kenn
>>>
>>> On Mon, Jul 2, 2018 at 7:05 PM Reuven Lax  wrote:
>>>
 The common use case for a timer is to read in data that was stored
 using the state API in processElement. There is no guarantee that is
 stable, and I believe no runner currently guarantees this. For example:

 class MyDoFn extends DoFn {
   @StateId("bag") private final StateSpec> buffer =
 StateSpec.bag(ElementCoder.of());
   @TimerId("timer") private final TimerSpec =
 TimerSpecs.timer(TimeDomain.PROCESSING_TIME);

   @ProcessElement public void processElement(@Element ElementT
 element, @StateId("bag") BagState bag, @TimerId("timer") Timer
 timer) {
   bag.add(element);

 timer.align(Duration.standardSeconds(30)).offset(Duration.standardSeconds(3)).setRelative();
   }

   @OnTimer("timer") public void onTimer(@StateId("bag")
 BagState bag) {
 sendToExternalSystem(bag.read());
   }
 }

 If you tagged onTimer with @RequiresStableInput, then you could
 guarantee that if the timer retried then it would read the same elements
 out of the bag. Today this is not guaranteed - the data written to the bag
 might not even be persisted yet when the timer fires (for example, both the
 processElement and the onTimer might be executed by the runner in the same
 bundle).

 This particular example is a simplistic one of course - you could
 accomplish the same thing with triggers. When Raghu worked on the
 exactly-once Kafka sink this was very problematic. The final solution used
 some specific details of Kafka to work, and is complicated and not portable
 to other sinks.

 BTW - you can of course just have OnTimer produce the output to another
 transform marked with RequiresStableInput. However this solution is very
 expensive - every element must be persisted to stable storage multiple
 times - and we tried hard to avoid doing this in the Kafka sink.

 Reuven

 On Mon, Jul 2, 2018 at 6:24 PM Robert Bradshaw 
 wrote:

> Could you give an example of such a usecase? (I suppose I'm not quite
> following what it means for a timer to be unstable...)
>
> On Mon, Jul 2, 2018 at 6:20 PM Reuven Lax  wrote:
>
>> One issue: we definitely have some strong use cases where we want
>> this on ProcessTimer but not on ProcessElement. Since both are on the 
>> same
>> DoFn, I'm not sure how you would represent this as a separate transform.
>>
>> On Mon, Jul 2, 2018 at 5:05 PM Robert Bradshaw 
>> wrote:
>>
>>> Thanks for the writeup.
>>>
>>> I'm wondering with, rather than phrasing this as an annotation on
>>> DoFn methods that gets plumbed down through the portability 
>>> representation,
>>> if it would make more sense to introduce a new, primitive
>>> "EnsureStableInput" transform. For those runners whose reshuffle provide
>>> stable inputs, they could use that as an implementation, and other 
>>> runners
>>> could provide other suitable implementations.
>>>
>>>
>>>
>>> On Mon, Jul 2, 2018 at 3:26 PM Robin Qiu  wrote:
>>>
 Hi everyone,

 Thanks for your feedback on the doc. I have revamped it according
 to all of the comments. The major changes I have made are:
 * The problem description should be more general and accurate now.
 * I added more background information, such as details about
 Reshuffle, so I should be easier to understand now.
 * I made it clear what is the scope of my current project and what
 could be left to future 

Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Lukasz Cwik
Instead of reverting/working around specific checks/tests that the
DirectRunner is doing, have you considered using one of the other runners
like Flink or Spark with a local execution cluster. You won't hit the
validation/verification bottlenecks that DirectRunner specifically imposes.

On Mon, Jul 9, 2018 at 8:46 AM Jean-Baptiste Onofré  wrote:

> Thanks for the update Eugene.
>
> @Vojta: do you mind to create a Jira ? I will tackle a fix for that.
>
> Regards
> JB
>
> On 09/07/2018 17:33, Eugene Kirpichov wrote:
> > Hi -
> >
> > If I remember correctly, the reason for this change was to ensure that
> > the state is encodable at all. Prior to the change, there had been
> > situations where the coder specified on a state cell is buggy, absent or
> > set incorrectly (due to some issue in coder inference), but direct
> > runner did not detect this because it never tried to encode the state
> > cells - this would have blown up in any distributed runner.
> >
> > I think it should be possible to relax this and clone only values being
> > added to the state, rather than cloning the whole state on copy(). I
> > don't have time to work on this change myself, but I can review a PR if
> > someone else does.
> >
> > On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré  > > wrote:
> >
> > Hi Vojta,
> >
> > I fully agree, that's why it makes sense to wait Eugene's feedback.
> >
> > I remember we had some performance regression on the direct runner
> > identified thanks to Nexmark, but it has been addressed by reverting
> a
> > change.
> >
> > Good catch anyway !
> >
> > Regards
> > JB
> >
> > On 09/07/2018 17:20, Vojtech Janota wrote:
> > > Hi Reuven,
> > >
> > > I'm not really complaining about DirectRunner. In fact it seems to
> > me as
> > > if what previously was considered as part of the "expensive extra
> > > checks" done by the DirectRunner is now done within the
> > > beam-runners-core-java library. Considering that all objects
> involved
> > > are immutable (in our case at least) and simple assignment is
> > > sufficient, the serialization-deserialization really seems as
> unwanted
> > > and hugely expensive correctness check. If there was a problem with
> > > identity copy, wasn't DirectRunner supposed to reveal it?
> > >
> > > Regards,
> > > Vojta
> > >
> > > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  > 
> > > >> wrote:
> > >
> > > Hi Vojita,
> > >
> > > One problem is that the DirectRunner is designed for testing,
> not
> > > for performance. The DirectRunner currently does many
> > > purposely-inefficient things, the point of which is to better
> > expose
> > > potential bugs in tests. For example, the DirectRunner will
> > randomly
> > > shuffle the order of PCollections to ensure that your code
> > does not
> > > rely on ordering.  All of this adds cost, because the current
> > runner
> > > is designed for testing. There have been requests in the past
> > for an
> > > "optimized" local runner, however we don't currently have such
> > a thing.
> > >
> > > In this case, using coders to clone values is more correct. In
> a
> > > distributed environment using encode/decode is the only way to
> > copy
> > > values, and the DirectRunner is trying to ensure that your
> code is
> > > correct in a distributed environment.
> > >
> > > Reuven
> > >
> > > On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> > > mailto:vojta.jan...@gmail.com>
> > >>
> wrote:
> > >
> > > Hi,
> > >
> > > We are using Apache Beam in our project for some time now.
> > Since
> > > our datasets are of modest size, we have so far used
> > > DirectRunner as the computation easily fits onto a single
> > > machine. Recently we upgraded Beam from 2.2 to 2.4 and
> > found out
> > > that performance of our pipelines drastically deteriorated.
> > > Pipelines that took ~3 minutes with 2.2 do not finish
> within
> > > hours now. We tried to isolate the change that causes the
> > > slowdown and came to the commits into the
> > > "InMemoryStateInternals" class:
> > >
> > > * https://github.com/apache/beam/commit/32a427c
> > > 
> > > * https://github.com/apache/beam/commit/8151d82
> > > 
> > >
> > > In a nutshell where previously the copy() method simply
> > assigned:
> > >
> > >   that.value = 

Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Jean-Baptiste Onofré
Thanks for the update Eugene.

@Vojta: do you mind to create a Jira ? I will tackle a fix for that.

Regards
JB

On 09/07/2018 17:33, Eugene Kirpichov wrote:
> Hi -
> 
> If I remember correctly, the reason for this change was to ensure that
> the state is encodable at all. Prior to the change, there had been
> situations where the coder specified on a state cell is buggy, absent or
> set incorrectly (due to some issue in coder inference), but direct
> runner did not detect this because it never tried to encode the state
> cells - this would have blown up in any distributed runner.
> 
> I think it should be possible to relax this and clone only values being
> added to the state, rather than cloning the whole state on copy(). I
> don't have time to work on this change myself, but I can review a PR if
> someone else does.
> 
> On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré  > wrote:
> 
> Hi Vojta,
> 
> I fully agree, that's why it makes sense to wait Eugene's feedback.
> 
> I remember we had some performance regression on the direct runner
> identified thanks to Nexmark, but it has been addressed by reverting a
> change.
> 
> Good catch anyway !
> 
> Regards
> JB
> 
> On 09/07/2018 17:20, Vojtech Janota wrote:
> > Hi Reuven,
> >
> > I'm not really complaining about DirectRunner. In fact it seems to
> me as
> > if what previously was considered as part of the "expensive extra
> > checks" done by the DirectRunner is now done within the
> > beam-runners-core-java library. Considering that all objects involved
> > are immutable (in our case at least) and simple assignment is
> > sufficient, the serialization-deserialization really seems as unwanted
> > and hugely expensive correctness check. If there was a problem with
> > identity copy, wasn't DirectRunner supposed to reveal it? 
> >
> > Regards,
> > Vojta
> >
> > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  
> > >> wrote:
> >
> >     Hi Vojita,
> >
> >     One problem is that the DirectRunner is designed for testing, not
> >     for performance. The DirectRunner currently does many
> >     purposely-inefficient things, the point of which is to better
> expose
> >     potential bugs in tests. For example, the DirectRunner will
> randomly
> >     shuffle the order of PCollections to ensure that your code
> does not
> >     rely on ordering.  All of this adds cost, because the current
> runner
> >     is designed for testing. There have been requests in the past
> for an
> >     "optimized" local runner, however we don't currently have such
> a thing.
> >
> >     In this case, using coders to clone values is more correct. In a
> >     distributed environment using encode/decode is the only way to
> copy
> >     values, and the DirectRunner is trying to ensure that your code is
> >     correct in a distributed environment.
> >
> >     Reuven
> >
> >     On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> >     mailto:vojta.jan...@gmail.com>
> >> wrote:
> >
> >         Hi,
> >
> >         We are using Apache Beam in our project for some time now.
> Since
> >         our datasets are of modest size, we have so far used
> >         DirectRunner as the computation easily fits onto a single
> >         machine. Recently we upgraded Beam from 2.2 to 2.4 and
> found out
> >         that performance of our pipelines drastically deteriorated.
> >         Pipelines that took ~3 minutes with 2.2 do not finish within
> >         hours now. We tried to isolate the change that causes the
> >         slowdown and came to the commits into the
> >         "InMemoryStateInternals" class:
> >
> >         * https://github.com/apache/beam/commit/32a427c
> >         
> >         * https://github.com/apache/beam/commit/8151d82
> >         
> >
> >         In a nutshell where previously the copy() method simply
> assigned:
> >
> >           that.value = this.value
> >
> >         There is now coder encode/decode combo hidden behind:
> >
> >           that.value = uncheckedClone(coder, this.value)
> >
> >         Can somebody explain the purpose of this change? Is it
> meant as
> >         an additional "enforcement" point, similar to DirectRunner's
> >         enforceImmutability and enforceEncodability? Or is it
> something
> >         that is genuinely needed to provide correct behaviour of the
> >         pipeline?
> >
> >         Any hints or thoughts are 

Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Eugene Kirpichov
Hi -

If I remember correctly, the reason for this change was to ensure that the
state is encodable at all. Prior to the change, there had been situations
where the coder specified on a state cell is buggy, absent or set
incorrectly (due to some issue in coder inference), but direct runner did
not detect this because it never tried to encode the state cells - this
would have blown up in any distributed runner.

I think it should be possible to relax this and clone only values being
added to the state, rather than cloning the whole state on copy(). I don't
have time to work on this change myself, but I can review a PR if someone
else does.

On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré  wrote:

> Hi Vojta,
>
> I fully agree, that's why it makes sense to wait Eugene's feedback.
>
> I remember we had some performance regression on the direct runner
> identified thanks to Nexmark, but it has been addressed by reverting a
> change.
>
> Good catch anyway !
>
> Regards
> JB
>
> On 09/07/2018 17:20, Vojtech Janota wrote:
> > Hi Reuven,
> >
> > I'm not really complaining about DirectRunner. In fact it seems to me as
> > if what previously was considered as part of the "expensive extra
> > checks" done by the DirectRunner is now done within the
> > beam-runners-core-java library. Considering that all objects involved
> > are immutable (in our case at least) and simple assignment is
> > sufficient, the serialization-deserialization really seems as unwanted
> > and hugely expensive correctness check. If there was a problem with
> > identity copy, wasn't DirectRunner supposed to reveal it?
> >
> > Regards,
> > Vojta
> >
> > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  > > wrote:
> >
> > Hi Vojita,
> >
> > One problem is that the DirectRunner is designed for testing, not
> > for performance. The DirectRunner currently does many
> > purposely-inefficient things, the point of which is to better expose
> > potential bugs in tests. For example, the DirectRunner will randomly
> > shuffle the order of PCollections to ensure that your code does not
> > rely on ordering.  All of this adds cost, because the current runner
> > is designed for testing. There have been requests in the past for an
> > "optimized" local runner, however we don't currently have such a
> thing.
> >
> > In this case, using coders to clone values is more correct. In a
> > distributed environment using encode/decode is the only way to copy
> > values, and the DirectRunner is trying to ensure that your code is
> > correct in a distributed environment.
> >
> > Reuven
> >
> > On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> > mailto:vojta.jan...@gmail.com>> wrote:
> >
> > Hi,
> >
> > We are using Apache Beam in our project for some time now. Since
> > our datasets are of modest size, we have so far used
> > DirectRunner as the computation easily fits onto a single
> > machine. Recently we upgraded Beam from 2.2 to 2.4 and found out
> > that performance of our pipelines drastically deteriorated.
> > Pipelines that took ~3 minutes with 2.2 do not finish within
> > hours now. We tried to isolate the change that causes the
> > slowdown and came to the commits into the
> > "InMemoryStateInternals" class:
> >
> > * https://github.com/apache/beam/commit/32a427c
> > 
> > * https://github.com/apache/beam/commit/8151d82
> > 
> >
> > In a nutshell where previously the copy() method simply assigned:
> >
> >   that.value = this.value
> >
> > There is now coder encode/decode combo hidden behind:
> >
> >   that.value = uncheckedClone(coder, this.value)
> >
> > Can somebody explain the purpose of this change? Is it meant as
> > an additional "enforcement" point, similar to DirectRunner's
> > enforceImmutability and enforceEncodability? Or is it something
> > that is genuinely needed to provide correct behaviour of the
> > pipeline?
> >
> > Any hints or thoughts are appreciated.
> >
> > Regards,
> > Vojta
> >
> >
> >
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Jean-Baptiste Onofré
Hi Vojta,

I fully agree, that's why it makes sense to wait Eugene's feedback.

I remember we had some performance regression on the direct runner
identified thanks to Nexmark, but it has been addressed by reverting a
change.

Good catch anyway !

Regards
JB

On 09/07/2018 17:20, Vojtech Janota wrote:
> Hi Reuven,
> 
> I'm not really complaining about DirectRunner. In fact it seems to me as
> if what previously was considered as part of the "expensive extra
> checks" done by the DirectRunner is now done within the
> beam-runners-core-java library. Considering that all objects involved
> are immutable (in our case at least) and simple assignment is
> sufficient, the serialization-deserialization really seems as unwanted
> and hugely expensive correctness check. If there was a problem with
> identity copy, wasn't DirectRunner supposed to reveal it? 
> 
> Regards,
> Vojta
> 
> On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  > wrote:
> 
> Hi Vojita,
> 
> One problem is that the DirectRunner is designed for testing, not
> for performance. The DirectRunner currently does many
> purposely-inefficient things, the point of which is to better expose
> potential bugs in tests. For example, the DirectRunner will randomly
> shuffle the order of PCollections to ensure that your code does not
> rely on ordering.  All of this adds cost, because the current runner
> is designed for testing. There have been requests in the past for an
> "optimized" local runner, however we don't currently have such a thing.
> 
> In this case, using coders to clone values is more correct. In a
> distributed environment using encode/decode is the only way to copy
> values, and the DirectRunner is trying to ensure that your code is
> correct in a distributed environment.
> 
> Reuven
> 
> On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> mailto:vojta.jan...@gmail.com>> wrote:
> 
> Hi,
> 
> We are using Apache Beam in our project for some time now. Since
> our datasets are of modest size, we have so far used
> DirectRunner as the computation easily fits onto a single
> machine. Recently we upgraded Beam from 2.2 to 2.4 and found out
> that performance of our pipelines drastically deteriorated.
> Pipelines that took ~3 minutes with 2.2 do not finish within
> hours now. We tried to isolate the change that causes the
> slowdown and came to the commits into the
> "InMemoryStateInternals" class:
> 
> * https://github.com/apache/beam/commit/32a427c
> 
> * https://github.com/apache/beam/commit/8151d82
> 
> 
> In a nutshell where previously the copy() method simply assigned:
> 
>   that.value = this.value
> 
> There is now coder encode/decode combo hidden behind:
> 
>   that.value = uncheckedClone(coder, this.value)
> 
> Can somebody explain the purpose of this change? Is it meant as
> an additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something
> that is genuinely needed to provide correct behaviour of the
> pipeline?
> 
> Any hints or thoughts are appreciated.
> 
> Regards,
> Vojta
> 
>  
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Vojtech Janota
Hi Reuven,

I'm not really complaining about DirectRunner. In fact it seems to me as if
what previously was considered as part of the "expensive extra checks" done
by the DirectRunner is now done within the beam-runners-core-java library.
Considering that all objects involved are immutable (in our case at least)
and simple assignment is sufficient, the serialization-deserialization
really seems as unwanted and hugely expensive correctness check. If there
was a problem with identity copy, wasn't DirectRunner supposed to reveal
it?

Regards,
Vojta

On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  wrote:

> Hi Vojita,
>
> One problem is that the DirectRunner is designed for testing, not for
> performance. The DirectRunner currently does many purposely-inefficient
> things, the point of which is to better expose potential bugs in tests. For
> example, the DirectRunner will randomly shuffle the order of PCollections
> to ensure that your code does not rely on ordering.  All of this adds cost,
> because the current runner is designed for testing. There have been
> requests in the past for an "optimized" local runner, however we don't
> currently have such a thing.
>
> In this case, using coders to clone values is more correct. In a
> distributed environment using encode/decode is the only way to copy values,
> and the DirectRunner is trying to ensure that your code is correct in a
> distributed environment.
>
> Reuven
>
> On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota 
> wrote:
>
>> Hi,
>>
>> We are using Apache Beam in our project for some time now. Since our
>> datasets are of modest size, we have so far used DirectRunner as the
>> computation easily fits onto a single machine. Recently we upgraded Beam
>> from 2.2 to 2.4 and found out that performance of our pipelines drastically
>> deteriorated. Pipelines that took ~3 minutes with 2.2 do not finish within
>> hours now. We tried to isolate the change that causes the slowdown and came
>> to the commits into the "InMemoryStateInternals" class:
>>
>> * https://github.com/apache/beam/commit/32a427c
>> * https://github.com/apache/beam/commit/8151d82
>>
>> In a nutshell where previously the copy() method simply assigned:
>>
>>   that.value = this.value
>>
>> There is now coder encode/decode combo hidden behind:
>>
>>   that.value = uncheckedClone(coder, this.value)
>>
>> Can somebody explain the purpose of this change? Is it meant as an
>> additional "enforcement" point, similar to DirectRunner's
>> enforceImmutability and enforceEncodability? Or is it something that is
>> genuinely needed to provide correct behaviour of the pipeline?
>>
>> Any hints or thoughts are appreciated.
>>
>> Regards,
>> Vojta
>>
>>
>>
>>
>>
>>


Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Reuven Lax
Hi Vojita,

One problem is that the DirectRunner is designed for testing, not for
performance. The DirectRunner currently does many purposely-inefficient
things, the point of which is to better expose potential bugs in tests. For
example, the DirectRunner will randomly shuffle the order of PCollections
to ensure that your code does not rely on ordering.  All of this adds cost,
because the current runner is designed for testing. There have been
requests in the past for an "optimized" local runner, however we don't
currently have such a thing.

In this case, using coders to clone values is more correct. In a
distributed environment using encode/decode is the only way to copy values,
and the DirectRunner is trying to ensure that your code is correct in a
distributed environment.

Reuven

On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota 
wrote:

> Hi,
>
> We are using Apache Beam in our project for some time now. Since our
> datasets are of modest size, we have so far used DirectRunner as the
> computation easily fits onto a single machine. Recently we upgraded Beam
> from 2.2 to 2.4 and found out that performance of our pipelines drastically
> deteriorated. Pipelines that took ~3 minutes with 2.2 do not finish within
> hours now. We tried to isolate the change that causes the slowdown and came
> to the commits into the "InMemoryStateInternals" class:
>
> * https://github.com/apache/beam/commit/32a427c
> * https://github.com/apache/beam/commit/8151d82
>
> In a nutshell where previously the copy() method simply assigned:
>
>   that.value = this.value
>
> There is now coder encode/decode combo hidden behind:
>
>   that.value = uncheckedClone(coder, this.value)
>
> Can somebody explain the purpose of this change? Is it meant as an
> additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something that is
> genuinely needed to provide correct behaviour of the pipeline?
>
> Any hints or thoughts are appreciated.
>
> Regards,
> Vojta
>
>
>
>
>
>


Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Jean-Baptiste Onofré
Hi,

Do you use specific/complex coders in your pipeline ?

I'm sure Eugene will propose some insights about this change: AFAIR, the
purpose is to have a cleaner use of coders and identify identity copy.

Regards
JB

On 09/07/2018 16:22, Vojtech Janota wrote:
> Hi,
> 
> We are using Apache Beam in our project for some time now. Since our
> datasets are of modest size, we have so far used DirectRunner as the
> computation easily fits onto a single machine. Recently we upgraded Beam
> from 2.2 to 2.4 and found out that performance of our pipelines
> drastically deteriorated. Pipelines that took ~3 minutes with 2.2 do not
> finish within hours now. We tried to isolate the change that causes the
> slowdown and came to the commits into the "InMemoryStateInternals" class:
> 
> * https://github.com/apache/beam/commit/32a427c
> * https://github.com/apache/beam/commit/8151d82
> 
> In a nutshell where previously the copy() method simply assigned:
> 
>   that.value = this.value
> 
> There is now coder encode/decode combo hidden behind:
> 
>   that.value = uncheckedClone(coder, this.value)
> 
> Can somebody explain the purpose of this change? Is it meant as an
> additional "enforcement" point, similar to DirectRunner's
> enforceImmutability and enforceEncodability? Or is it something that is
> genuinely needed to provide correct behaviour of the pipeline?
> 
> Any hints or thoughts are appreciated.
> 
> Regards,
> Vojta
> 
>  
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Performance issue in Beam 2.4 onwards

2018-07-09 Thread Vojtech Janota
Hi,

We are using Apache Beam in our project for some time now. Since our
datasets are of modest size, we have so far used DirectRunner as the
computation easily fits onto a single machine. Recently we upgraded Beam
from 2.2 to 2.4 and found out that performance of our pipelines drastically
deteriorated. Pipelines that took ~3 minutes with 2.2 do not finish within
hours now. We tried to isolate the change that causes the slowdown and came
to the commits into the "InMemoryStateInternals" class:

* https://github.com/apache/beam/commit/32a427c
* https://github.com/apache/beam/commit/8151d82

In a nutshell where previously the copy() method simply assigned:

  that.value = this.value

There is now coder encode/decode combo hidden behind:

  that.value = uncheckedClone(coder, this.value)

Can somebody explain the purpose of this change? Is it meant as an
additional "enforcement" point, similar to DirectRunner's
enforceImmutability and enforceEncodability? Or is it something that is
genuinely needed to provide correct behaviour of the pipeline?

Any hints or thoughts are appreciated.

Regards,
Vojta


Beam Dependency Check Report (2018-07-09)

2018-07-09 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
dill
0.2.6
0.2.8.2
2017-02-01
2018-06-25


google-cloud-bigquery
0.25.0
1.3.0
2017-06-26
2018-06-08


google-cloud-core
0.25.0
0.28.1
2018-06-07
2018-06-07


google-cloud-pubsub
0.26.0
0.35.4
2017-06-26
2018-06-08


ply
3.8
3.11
2018-06-07
2018-06-07


High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
org.assertj:assertj-core
2.5.0
3.10.0
2016-07-03
2018-05-11


com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2018-06-25
2017-12-11


biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2018-06-25
2018-06-25


org.apache.cassandra:cassandra-all
3.9
3.11.2
2016-09-26
2018-02-14


org.apache.commons:commons-dbcp2
2.1.1
2.4.0
2015-08-02
2018-06-18


de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-25


de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-11
2018-06-25


org.apache.derby:derby
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbyclient
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbynet
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.elasticsearch:elasticsearch
5.6.3
6.3.1
2017-10-06
2018-07-09


org.elasticsearch:elasticsearch-hadoop
5.0.0
6.3.1
2016-10-26
2018-07-09


org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.3.1
2017-10-06
2018-07-09


com.alibaba:fastjson
1.2.12
1.2.47
2016-05-21
2018-03-15


org.elasticsearch.test:framework
5.6.3
6.3.1
2017-10-06
2018-07-09


org.freemarker:freemarker
2.3.25-incubating
2.3.28
2016-06-14
2018-03-30


net.ltgt.gradle:gradle-apt-plugin
0.13
0.17
2017-11-01
2018-06-25


com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.14.2
2018-01-30
2018-06-06


gradle.plugin.com.palantir.gradle.docker:gradle-docker
0.13.0
0.20.1
2017-04-05
2018-07-09


com.github.ben-manes:gradle-versions-plugin
0.17.0
0.20.0
2018-06-06
2018-06-25


org.codehaus.groovy:groovy-all
2.4.13
3.0.0-alpha-3
2017-11-22
2018-06-26


org.apache.hbase:hbase-common
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-hadoop-compat
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-server
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-shaded-client
1.2.6
2.0.1
2017-05-29
2018-06-25


org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2018-05-31


org.apache.hive:hive-cli
2.1.0
3.1.0.3.0.0.0-1574
2016-06-16
2018-07-09


org.apache.hive:hive-common
2.1.0
3.1.0.3.0.0.0-1574
2016-06-16
2018-07-09


org.apache.hive:hive-exec
2.1.0
3.1.0.3.0.0.0-1574
2016-06-16
2018-07-09


org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.0.3.0.0.0-1574
2016-06-16
2018-07-09


org.apache.httpcomponents:httpclient
4.5.2
4.5.6
2016-02-21
2018-07-09


org.apache.httpcomponents:httpcore
4.4.5
4.4.10
2016-06-08
2018-07-02



Re: Ability to read from UTF-16 or UTF-32 encoded files?

2018-07-09 Thread Etienne Chauchot
Hi,
Just a little precision. TextIO actually already supports custom multi-bytes 
delimiter in place of new lines. See
TextIO#withDelimiter(byte[] delimiter)
Etienne
Le samedi 07 juillet 2018 à 16:15 -0700, Robert Bradshaw a écrit :
> Currently TextIO scans for newlines to find line (record) boundaries, but 
> this can occur as part of a character for
> UTF-16 or UTF-32. It could be certainly adapted to look for multi-byte 
> patterns (with the right offset) but this would
> be more complicated. 
> Fortunately, the default of UTF-8 handles non-western languages very well, 
> but an option to support other encodings
> would be welcome. 
> 
> On Sat, Jul 7, 2018 at 1:33 PM Harry Braviner  
> wrote:
> > Is there any reason that TextIO couldn't be expanded to read from UTF-16 or 
> > UTF-32 encoded files?
> > 
> > Certainly Python and Java strings support UTF-16, so there shouldn't be 
> > additional complication there.
> > 
> > This would make Beam able to process non-latin character sets more easily. 
> > It would also alleviate a bug I ran into
> > while doing the MinimalWordCount tutorial: the first dataset you find if 
> > you Google "shakespeare corpus" (http://lex
> > ically.net/wordsmith/support/shakespeare.html) is UTF-16 encoded. The first 
> > byte of each UTF-16 character gets
> > interpreted as a non-letter UTF-8 character, and the pipeline gives a 
> > letter count instead. However, I think being
> > able to handle non-western languages would be the far greater benefit from 
> > this.
> >