Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Dan Halperin
>
>
> On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri  wrote:
>
>> Both options risk delaying worker shutdown if the executor's shutdown()
>> hasn't been called, which is I guess why the executor in GcsOptions.java
>> creates daemon threads.
>>
>
My guess (and it really is a guess at this point) is that this was a fix
for DirectRunner issues - want that to exit quickly!


>
>> On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:
>>
>>> Not certain, it looks like we should have been caching the executor
>>> within the GcsUtil as a static instance instead of creating one each time.
>>> Could have been missed during code review / slow code changes over time.
>>> GcsUtil is not well "loved".
>>>
>>> On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:
>>>
 HI,
 I've identified a memory leak when GcsUtil.java instantiates a
 ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
 The code uses the getExitingExecutorService
 
  wrapper,
 which leaks memory. The question is, why is that wrapper necessary
 if executor.shutdown(); is later unconditionally called?

>>>


Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Dan Halperin
Hey Udi,

Thanks for the commit comment
.
I'll try to dump any (old) mental context I have left..

We were trying to find the right point in a space of:

* enough parallelism to speed things up
- more than 256 didn't seem to help with perf, and 256 was not close to
looking like a DDOS to GCS, so we were not in danger of quota limits being
imposed.
* enough batches to speed things up
- batches -> fewer RPCs from the job to GCS, more RPCs inside GCS. 100
was again good for perf and good for quotas.
* small enough batches to spread the load
- if you think about the multiple layers of fanout in RPC handling,
sending a batch RPC cuts out the first layer of load-balancing. The
endpoint that receives the batch itself handles all the individual
requests, and if that endpoint is slow or any individual request in the
batch is slow, the entire batch is slow. Prefer many batches in flight, so
as to not be limited by the performance of that single endpoint.

Coming to the question in the PR comment:
> Any reason not to use this.executorService?

I think the main reason to not use `this.executor` is that we didn't have
constraints on that executor in terms of either upper or lower bounds on
parallelism, so it seemed safer to use our own with known limits.

Thanks for catching the memory leak though - we didn't at the time! I'll
defer to you (and especially Luke ;) on a good solution to fix.
Dan

On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri  wrote:

> My thought was to use 1 executor per GcsUtil instance (or 1 per process as
> you suggest), with a possible performance hit since I don't know how often
> these batch copy and remove operations are called.
> The other option is to leave things as they mostly are, and only remove
> the call to getExitingExecutorService.
>
> Both options risk delaying worker shutdown if the executor's shutdown()
> hasn't been called, which is I guess why the executor in GcsOptions.java
> creates daemon threads.
>
> On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:
>
>> Not certain, it looks like we should have been caching the executor
>> within the GcsUtil as a static instance instead of creating one each time.
>> Could have been missed during code review / slow code changes over time.
>> GcsUtil is not well "loved".
>>
>> On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:
>>
>>> HI,
>>> I've identified a memory leak when GcsUtil.java instantiates a
>>> ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
>>> The code uses the getExitingExecutorService
>>> 
>>>  wrapper,
>>> which leaks memory. The question is, why is that wrapper necessary
>>> if executor.shutdown(); is later unconditionally called?
>>>
>>


Re: [PROPOSAL] Move sorting to sdks-java-core

2018-10-17 Thread Dan Halperin
On Wed, Oct 17, 2018 at 3:44 PM Kenneth Knowles  wrote:

> The runner can always just depend on the sorter to do it the legacy way by
> class matching; it shouldn't incur other dependency penalties... but now
> that I look briefly, the sorter depends on Hadoop bits. That seems a heavy
> price to pay for a user in any event. Are those Hadoop deps reasonably
> self-contained?
>

Nice catch, Kenn! This is indeed why we didn't originally include the
Sorter in core. The Hadoop deps have an enormous surface, or did at the
time.

Dan


>
> Kenn
>
> On Wed, Oct 17, 2018 at 2:39 PM Lukasz Cwik  wrote:
>
>> Merging the sorter into sdks-java-core isn't needed for pipelines
>> executed via portability since the Runner will be able to perform
>> PTransform replacement and optimization based upon the URN of the transform
>> and its payload so it would never need to have the "Sorter" class in its
>> classpath.
>>
>> I'm ambivalent about whether merging it now is worth it.
>>
>> On Wed, Oct 17, 2018 at 2:31 PM David Morávek 
>> wrote:
>>
>>> We can always fall back to the External sorter in case of merging
>>> windows. I reckon in this case, values usually fit in memory, so it would
>>> not be an issue.
>>>
>>> In case of non-merging windows, runner implementation would probably
>>> require to group elements also by window during shuffle.
>>>
>>> On Wed, Oct 17, 2018 at 11:10 PM Reuven Lax  wrote:
>>>
 One concern would be merging windows. This happens after shuffle, so
 even if the shuffle were sorted you would need to do a sorted merge of two
 sorted buffers.

 On Wed, Oct 17, 2018 at 2:08 PM David Morávek 
 wrote:

> Hello,
>
> I want to summarize my thoughts on the per key value sorting.
>
> Currently we have a separate module for sorting extension. The
> extension contains *SortValues* transformation and implementations of
> different sorters.
>
> Performance-wise it would be great to be able* to delegate sorting to
> a runner* if it supports sort based shuffle. In order to do so, we
> should *move SortValues transformation to sdks-java-core*, so a
> runner can easily provide its own implementation.
>
> The robust implementation is needed mainly for building of HFiles for
> the HBase bulk load. When using external sorter, we often sort the whole
> data set twice (shuffle may already did a job).
>
> SortValues can not use custom comparator, because we want to be able
> to push sorting logic down to a byte based shuffle.
>
> The usage of SortValues transformation is little bit confusing. I
> think we should add a *SortValues.perKey* method, which accepts a
> secondary key extractor and coder, as the usage would be easier to
> understand. Also, this explicitly states, that we sort values *perKey*
> only and that we sort using an *encoded secondary key*. Example usage:
>
>
> *PCollection> input = ...;*
> *input.apply(SortValues.perKey(KV::getValue, BigEndianLongCoder.of()))*
>
> What do you think? Is this the right direction?
>
> Thanks for the comments!
>
> Links:
> -
> http://mail-archives.apache.org/mod_mbox/beam-dev/201805.mbox/%3Cl8D.1U3Hp.5IxQdKoVDzH.1R3dyk%40seznam.cz%3E
>



Re: Donating the Dataflow Worker code to Apache Beam

2018-09-13 Thread Dan Halperin
>From my perspective as a (non-Google) community member, huge +1.

I don't see anything bad for the community about open sourcing more of the
probably-most-used runner. While the DirectRunner is probably still the
most referential implementation of Beam, can't hurt to see more working
code. Other runners or runner implementors can refer to this code if they
want, and ignore it if they don't.

In terms of having more code and tests to support, well, that's par for the
course. Will this change make the things that need to be done to support
them more obvious? (E.g., "this PR is blocked because someone at Google on
Dataflow team has to fix something" vs "this PR is blocked because the
Apache Beam code in foo/bar/baz is failing, and anyone who can see the code
can fix it"). The latter seems like a clear win for the community.

(As long as the code donation is handled properly, but that's completely
orthogonal and I have no reason to think it wouldn't be.)

Thanks,
Dan

On Thu, Sep 13, 2018 at 11:06 AM Lukasz Cwik  wrote:

> Yes, I'm specifically asking the community for opinions as to whether it
> should be accepted or not.
>
> On Thu, Sep 13, 2018 at 10:51 AM Raghu Angadi  wrote:
>
>> This is terrific!
>>
>> Is thread asking for opinions from the community about if it should be
>> accepted? Assuming Google side decision is made to contribute, big +1 from
>> me to include it next to other runners.
>>
>> On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik  wrote:
>>
>>> At Google we have been importing the Apache Beam code base and
>>> integrating it with the Google portion of the codebase that supports the
>>> Dataflow worker. This process is painful as we regularly are making
>>> breaking API changes to support libraries related to running portable
>>> pipelines (and sometimes in other places as well). This has made it
>>> sometimes difficult for PR changes to make changes without either breaking
>>> something for Google or waiting for a Googler to make the change internally
>>> (e.g. dependency updates).
>>>
>>> This code is very similar to the other integrations that exist for
>>> runners such as Flink/Spark/Apex/Samza. It is an adaption layer that sits
>>> on top of an execution engine. There is no super secret awesome stuff as
>>> this code was already publicly visible in the past when it was part of the
>>> Google Cloud Dataflow github repo[1].
>>>
>>> Process wise the code will need to get approval from Google to be
>>> donated and for it to go through the code donation process but before we
>>> attempt to do that, I was wondering whether the community would object to
>>> adding this code to the master branch?
>>>
>>> The up side is that people can make breaking changes and fix it for all
>>> runners. It will also help Googlers contribute more to the portability
>>> story as it will remove the burden of doing the code import (wasted time)
>>> and it will allow people to develop in master (can have the whole project
>>> loaded in a single IDE).
>>>
>>> The downsides are that this will represent more code and unit tests to
>>> support.
>>>
>>> 1:
>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker
>>>
>>


Re: Broken links to releases

2018-06-19 Thread Dan Halperin
Looks like JB removed these during release of 2.3.0:

svn commit: r25111 - in /release/beam: ./ 0.1.0-incubating/
0.2.0-incubating/ 0.3.0-incubating/ 0.4.0/ 0.5.0/ 0.6.0/ 2.0.0/ 2.1.0/
2.1.1/ 2.2.0/ 2.3.0/

Author: jbonofre
Date: Sat Feb 17 06:08:19 2018
New Revision: 25111

Log:
Publish 2.3.0 release

Added:
release/beam/2.3.0/
release/beam/2.3.0/apache-beam-2.3.0-python.zip   (with props)
release/beam/2.3.0/apache-beam-2.3.0-python.zip.asc
release/beam/2.3.0/apache-beam-2.3.0-python.zip.md5
release/beam/2.3.0/apache-beam-2.3.0-python.zip.sha1
release/beam/2.3.0/apache-beam-2.3.0-source-release.zip   (with props)
release/beam/2.3.0/apache-beam-2.3.0-source-release.zip.asc
release/beam/2.3.0/apache-beam-2.3.0-source-release.zip.md5
release/beam/2.3.0/apache-beam-2.3.0-source-release.zip.sha1
release/beam/latest   (with props)
Removed:
release/beam/0.1.0-incubating/
release/beam/0.2.0-incubating/
release/beam/0.3.0-incubating/
release/beam/0.4.0/
release/beam/0.5.0/
release/beam/0.6.0/
release/beam/2.0.0/
release/beam/2.1.0/
release/beam/2.1.1/
release/beam/2.2.0/


I don't think this is the intended release process -- don't we want them up
forever?

Suggest partial revert of that SVN commit to restore the old releases.

On Tue, Jun 19, 2018 at 1:39 PM Chamikara Jayalath 
wrote:

> I noticed that links to source code of releases is broken in the following
> page for everything but the latest release.
>
> https://beam.apache.org/get-started/downloads/
>
> For example
> http://www-eu.apache.org/dist/beam/2.3.0/apache-beam-2.3.0-source-release.zip
>
> Anybody know how to fix this ?
>
> - Cham
>


Re: [VOTE] Code Review Process

2018-06-01 Thread Dan Halperin
+1 -- this is encoding what I previously thought the process was and what,
in practice, I think was often the behavior of committers anyway.

On Fri, Jun 1, 2018 at 12:21 PM, Yifan Zou  wrote:

> +1
>
> On Fri, Jun 1, 2018 at 12:10 PM Robert Bradshaw 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 12:06 PM Chamikara Jayalath 
>> wrote:
>>
>>> +1
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Jun 1, 2018 at 11:36 AM Jason Kuster 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 11:36 AM Ankur Goenka  wrote:

> +1
>
> On Fri, Jun 1, 2018 at 11:28 AM Charles Chen  wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 11:20 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 10:40 AM, Ahmet Altay 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 10:37 AM, Kenneth Knowles 
 wrote:

> +1
>
> On Fri, Jun 1, 2018 at 10:25 AM Thomas Groh 
> wrote:
>
>> As we seem to largely have consensus in "Reducing Committer Load
>> for Code Reviews"[1], this is a vote to change the Beam policy on 
>> Code
>> Reviews to require that
>>
>> (1) At least one committer is involved with the code review, as
>> either a reviewer or as the author
>> (2) A contributor has approved the change
>>
>> prior to merging any change.
>>
>> This changes our policy from its current requirement that at
>> least one committer *who is not the author* has approved the change 
>> prior
>> to merging. We believe that changing this process will improve code 
>> review
>> throughput, reduce committer load, and engage more of the community 
>> in the
>> code review process.
>>
>> Please vote:
>> [ ] +1: Accept the above proposal to change the Beam code
>> review/merge policy
>> [ ] -1: Leave the Code Review policy unchanged
>>
>> Thanks,
>>
>> Thomas
>>
>> [1] https://lists.apache.org/thread.html/7c1fde3884fbefacc25
>> 2b6d4b434f9a9c2cf024f381654aa3e47df18@%3Cdev.beam.apache.org%3E
>>
>

>>>

 --
 ---
 Jason Kuster
 Apache Beam / Google Cloud Dataflow

 See something? Say something. go/jasonkuster-feedback
 

>>>


Re: Gradle Status [April 6]

2018-04-09 Thread Dan Halperin
On Sat, Apr 7, 2018 at 12:43 Reuven Lax  wrote:

> So if I understand correctly, we've migrated all precommit, most
> postcommits, and we have a working release process using Gradle. There are
> a few bugs left, but at this pace it sounds like we're close to fully
> migrated.
>
> I know that multiple people put it long hours last getting this done last
> week (just look at the Slack messages!). This is awesome progress, and a
> hearty thank you to everyone who put in their time.
>
> Reuven
>
> On Fri, Apr 6, 2018 at 7:52 PM Scott Wegner  wrote:
>
>> Here's an end-of-day update on migration work:
>>
>> * Snapshot unsigned dailies and signed release builds are working (!!).
>> PR/5048 [1] merges changes from Luke's branch
>>   * python precommit failing... will investigate python precommit Monday
>> * All Precommits are gradle only
>> * All Postcommits except performance tests and Java_JDK_Versions_Test
>> use gradle (after PR/5047 [2] merged)
>> * Nightly snapshot release using gradle is ready; needs PR/5048 to be
>> merged before switching
>> * ValidatesRunner_Spark failing consistently; investigating
>>
>> Thanks for another productive day of hacking. I'll pick up again on
>> Monday.
>>
>> [1] https://github.com/apache/beam/pull/5048
>> [2] https://github.com/apache/beam/pull/5047
>>
>>
This is really phenomenal work, and a huge boost to the community. Thanks,
everyone who participated!
Dan


>
>> On Fri, Apr 6, 2018 at 11:24 AM Romain Manni-Bucau 
>> wrote:
>>
>>> Why building a zip per runner which its stack and just pointing out on
>>> that zip and let beam lazy load the runner:
>>>
>>> --runner=LazyRunner --lazyRunnerDir=... --lazyRunnerOptions=... (or the
>>> fromSystemProperties() if it gets merged a day ;))
>>>
>>> Le 6 avr. 2018 20:21, "Kenneth Knowles"  a écrit :
>>>
 I'm working on finding a solution for launching the Nexmark suite with
 each runner. This doesn't have to be done via Gradle, but we anyhow need
 built artifacts that don't require user classpath intervention.

 It looks to me like the examples are also missing this - they have
 separate configuration e.g. sparkRunnerPreCommit but that is overspecified
 compared to a free-form launching of a main() program with a runner 
 profile.

 On Fri, Apr 6, 2018 at 11:09 AM Lukasz Cwik  wrote:

> Romain, are you talking about the profiles that exist as part of the
> archetype examples?
>
> If so, then those still exist and haven't been changed. If not, can
> you provide a link to the profile in a pom file to be clearer?
>
> On Fri, Apr 6, 2018 at 12:40 PM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Hi Scott,
>>
>> is it right that 2 doesn't handle the hierachy anymore and that it
>> doesn't handle profiles for runners as it is currently with maven?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-04-06 18:32 GMT+02:00 Scott Wegner :
>>
>>> I wanted to start a thread to summarize the current state of Gradle
>>> migration. We've made lots of good progress so far this week. Here's the
>>> status from what I can tell-- please add or correct anything I missed:
>>>
>>> * Release artifacts can be built and published for Snapshot and
>>> officlal releases [1]
>>> * Gradle-generated releases have been validated with the the Apache
>>> Beam archetype generation quickstart; still needs additional validation.
>>> * Generated release pom files have correct project metadata [2]
>>> * The python pre-commits are now working in Gradle [3]
>>> * Ismaël has started a collaborative doc of Gradle tips [4] as we
>>> all learn the new system-- please add your own. This will eventually 
>>> feed
>>> into official documentation on the website.
>>> * Łukasz Gajowy is working on migrating performance testing
>>> framework [5]
>>> * Daniel is working on updating documentation to refer to Gradle
>>> instead of maven
>>>
>>> If I missed anything, please add it to this thread.
>>>
>>> The general roadmap we're working towards is:
>>> (a) Publish release artifacts with Gradle (SNAPSHOT and signed
>>> releases)
>>> (b) Postcommits migrated to Gradle
>>> (c) Migrate documentation from maven to Gradle
>>> (d) Migrate perfkit suites to use Gradle
>>>
>>> For those of you that are hacking: thanks for your help so far!
>>> Progress is being roughly tracked on the Kanban 

Re: Gradle status

2018-03-22 Thread Dan Halperin
On Thu, Mar 22, 2018 at 11:19 AM, Chamikara Jayalath <chamik...@google.com>
wrote:

> I don't think incremental progress is a bad thing as long as we are making
> progress towards the goal. Do we need better metrics (a weekly email ?)
> about the progress towards moving everything to Gradle ? I agree with
> others who pointed out that there are many unresolved JIRAs and simply
> deleting Maven artifacts could break many things (for example, performance
> tests).
>

The problem does not seem to be incremental progress on its face, or a lack
of metrics.

The problem is that there are two build systems with separate features and
issues, doubling (or worse) Jenkins cycles, mental effort, maintenance
burden, etc. It hurts the community and casual contributors.

As Luke suggested,
> A process vote can be happen if the in-between state is too painful to
maintain.

Given that the in-between state has lasted so long, and there is it may be
time.

Dan



>
> Thanks,
> Cham
>
>
> On Thu, Mar 22, 2018 at 10:56 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> Le 22 mars 2018 18:49, "Dan Halperin" <dhalp...@apache.org> a écrit :
>>
>> It seems that a few groups are talking past each other.
>>
>> * A sizable contingent is interested in a move to Gradle -- it shows
>> promise, but the work is incomplete.
>> * Another contingent noticing the large burden of maintaining multiple
>> build systems. FWICT, both test suites have been broken quite a lot
>> recently, mainly the gradle ones, which is a cost to the community. This is
>> creating a barrier to entry for new contributors – especially those who
>> don't work in the same room or do their primary development in a different
>> repository.
>>
>> I don't see this situation being resolved to anyone's satisfaction until
>> there's only one build system left. The onus is clearly on the Gradle
>> promoters to finish the work.
>>
>> Luke made a suggestion 2.5 months ago that we should have a process vote
>> if this situation is untenable. It seems like we're there.
>>
>>
>> Yes but beam voted to move to gradle so we should but we shouldnt
>> maintain 2 build systems for more than 2 months (weway overpassed that) and
>> therefore the vote should be cancelled or validated by an action.
>>
>> I understand you want gradle but you dont want to pay the cost to move to
>> gradle, it doesnt work for the project do please another option
>> (rollbacking gradle or removing maven, all other options are negative for
>> the project and a pain for committers and contributors whatever you think).
>>
>>
>>
>> Thanks,
>> Dan
>>
>> On Thu, Mar 22, 2018 at 10:30 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Ok so to be clear for any contributor (which is the goal of this
>>> thread): maven is still the main build system and no need to maintain
>>> gradle in PR then until beam switches.
>>>
>>> Im more than fine with that.
>>>
>>> Le 22 mars 2018 18:22, "Alan Myrvold" <amyrv...@google.com> a écrit :
>>>
>>>> I think the investment in gradle is worthwhile, and incrementally we
>>>> will continue to make progress. From what I've seem, gradle is a good fit
>>>> for this project and a path to a faster, more reliable build system.
>>>>
>>>> pull/4812 <https://github.com/apache/beam/pull/4812> creates the
>>>> release artifacts, although it is not hooked up yet with authentication.
>>>>
>>>> I expect to help make incremental progress over the next month
>>>> converting some of the integration tests, but welcome incremental
>>>> improvements from others.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 9:57 AM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> 2018-03-22 17:45 GMT+01:00 Lukasz Cwik <lc...@google.com>:
>>>>>
>>>>>> what do we do? "Gradle migration will happen incrementally."
>>>>>>
>>>>>> "last months prooved beam cant maintain 2 systems, easier with that
>>>>>> state is then to drop gradle since it is a 0 investment compared to the
>>>>>> opposite"
>>>>>> Its unfortunate that you feel this way but many people do not share
>>>>>> your opinion.
>>>>>>
>>>>>
>>>>> And

Re: Gradle status

2018-03-22 Thread Dan Halperin
It seems that a few groups are talking past each other.

* A sizable contingent is interested in a move to Gradle -- it shows
promise, but the work is incomplete.
* Another contingent noticing the large burden of maintaining multiple
build systems. FWICT, both test suites have been broken quite a lot
recently, mainly the gradle ones, which is a cost to the community. This is
creating a barrier to entry for new contributors – especially those who
don't work in the same room or do their primary development in a different
repository.

I don't see this situation being resolved to anyone's satisfaction until
there's only one build system left. The onus is clearly on the Gradle
promoters to finish the work.

Luke made a suggestion 2.5 months ago that we should have a process vote if
this situation is untenable. It seems like we're there.

Thanks,
Dan

On Thu, Mar 22, 2018 at 10:30 AM, Romain Manni-Bucau 
wrote:

> Ok so to be clear for any contributor (which is the goal of this thread):
> maven is still the main build system and no need to maintain gradle in PR
> then until beam switches.
>
> Im more than fine with that.
>
> Le 22 mars 2018 18:22, "Alan Myrvold"  a écrit :
>
>> I think the investment in gradle is worthwhile, and incrementally we will
>> continue to make progress. From what I've seem, gradle is a good fit for
>> this project and a path to a faster, more reliable build system.
>>
>> pull/4812  creates the release
>> artifacts, although it is not hooked up yet with authentication.
>>
>> I expect to help make incremental progress over the next month converting
>> some of the integration tests, but welcome incremental improvements from
>> others.
>>
>>
>>
>> On Thu, Mar 22, 2018 at 9:57 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> 2018-03-22 17:45 GMT+01:00 Lukasz Cwik :
>>>
 what do we do? "Gradle migration will happen incrementally."

 "last months prooved beam cant maintain 2 systems, easier with that
 state is then to drop gradle since it is a 0 investment compared to the
 opposite"
 Its unfortunate that you feel this way but many people do not share
 your opinion.

>>>
>>> And a lot do so when a project is 50-50 it is time to act.
>>>
>>> Incrementally kind of means never (makes 4 months and nothing really
>>> changed in PRs and habits, gradle maintener(s) are still alone)
>>>
>>>


 On Thu, Mar 22, 2018 at 9:32 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> @Valentyn: concretely any user can PR and be part of that process so
> anyone can do it wrong (me first)
> @Luskasz, Hennking: fine but what do we do? last months prooved beam
> cant maintain 2 systems, easier with that state is then to drop gradle
> since it is a 0 investment compared to the opposite
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-22 17:24 GMT+01:00 Lukasz Cwik :
>
>> Romain, from the previous discussions several people agreed that
>> running a fixit that migrated Maven to Gradle over a 1 or 2 day period 
>> was
>> worthwhile but there was nobody in the community with the time commitment
>> to organize and run it so the status quo plan remained where the Gradle
>> migration will happen incrementally.
>>
>>
>> On Thu, Mar 22, 2018 at 8:53 AM Henning Rohde 
>> wrote:
>>
>>> My understanding was the same as Ismaël's. I don't think breaking
>>> the build with a large known gaps (but not fully known cost) is 
>>> practical.
>>> Also, most items in the jira are not even assigned yet.
>>>
>>>
>>> On Thu, Mar 22, 2018 at 8:03 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Not really Ismaël, this thread was about to do it at once and have
 1 day to fix it all.

 As mentionned at the very beginning nobody maintains the 2 system
 so it must stop after months so either we drop maven or gradle *at 
 once*
 or we keep a state where each dev does what he wants and the build
 system just doesn't work.

 2018-03-22 15:42 GMT+01:00 Ismaël Mejía :

> I don't think that removing all maven descriptors was the expected
> path, no ? Or even a good idea at this moment.
>
> I understood that what we were going to do was to replace
> incrementally the CI until we cover the whole maven functionality

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Dan Halperin
Looks like it was a good talk! Why is it Google Confidential & Proprietary,
though?

Dan

On Thu, Mar 8, 2018 at 11:49 AM, Eugene Kirpichov 
wrote:

> Hey all,
>
> The slides for my yesterday's talk at Strata San Jose https://conferences.
> oreilly.com/strata/strata-ca/public/schedule/detail/63696 have been
> posted on the talk page. They may be of interest both to users and IO
> authors.
>
> Thanks.
>


Re: [INFO] Build fails on GCP IO (Spanner)

2017-05-29 Thread Dan Halperin
This looks like somewhere the unit tests are inferring a project from the
environment when they should not be doing so.

On Mon, May 29, 2017 at 8:38 AM, Jean-Baptiste Onofré 
wrote:

> Gonna try to purge my local m2 repo.
>
> Regards
> JB
>
>
> On 05/29/2017 08:05 AM, Jean-Baptiste Onofré wrote:
>
>> Hi team,
>>
>> Since last week, the build is broken due to tests failure on the
>> GCP/Spanner IO:
>>
>> java.lang.IllegalArgumentException: A project ID is required for this
>> service but could not be determined from the builder or the environment.
>> Please set a project ID using the builder.
>>
>> However, Jenkins seems OK on this. I checked and I don't see anything
>> special in the system variables or JVM arguments.
>>
>> I started a change on the SpannerIO to get the project ID in the code in
>> order to have the tests OK (fixing SpannerIO write). Depending of the
>> answers on this e-mail, I will create a pull request.
>>
>> Do you think it's reasonable ? I don't see anything special in the
>> READMEr about new prerequisites for SpannerIO.
>>
>> Does anyone else notice this tests failure ?
>>
>> Thanks,
>> Regards
>> JB
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: graph generator?

2017-05-25 Thread Dan Halperin
I think that a util that converted from the Runner API definition of a
pipeline into some sort of graph format (like DOT?) would be generally
useful. By using the Runner API, the tool would be SDK- and
Runner-independent view of the pipeline.

On Thu, May 25, 2017 at 10:54 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> If you mean a graphical tool, no, it's up to each execution engine (it's
> what we showed last week at ApacheCon with Davor).
>
> Some tools can graphically generate the graph with the corresponding Beam
> pipeline.
>
> Regards
> JB
>
>
> On 05/25/2017 07:48 PM, Romain Manni-Bucau wrote:
>
>> Hello guys,
>>
>> does beam have a graph generator from a pipeline? Not sure current API
>> fully allows to bypass the runner to just get the beam graph but it can
>> help to have a small main generating a png/svg/ascii/ditaa (or a
>> maven/gradle plugin ;))
>>
>> Needed that in by hazelcast-jet work to visualize the graph, i have a
>> quick
>> and dirty impl based on jung (BSD license :() based on jet graph but think
>> it should be pretty trivial to use directly beam graph based on a pipeline
>> visitor.
>>
>> Mainly sending this mail to share it in case anyone needs it more than
>> anything else:
>> https://gist.github.com/rmannibucau/b5f4e310b40ce414f95f6e22530bbe6e
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github <
>> https://github.com/rmannibucau> |
>> LinkedIn  | JavaEE Factory
>> 
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Behavior of Top.Largest

2017-05-21 Thread Dan Halperin
I think this is an unrealistic request -- Python and Java workflows are
completely different, and Python developer documentation is especially
abysmal.

(E.g., I had to have Robert sit with me to get the Python SDK to work at
all on my developer machine, and even then I gave up and chmod-ed my
machine-wide Python repos to be world-writable to get it to work.)

On Fri, May 19, 2017 at 4:50 PM, Ahmet Altay 
wrote:

> I mentioned this in the PR, I believe it is worth repeating here.
>
> It is important to keep the API consistent between Python and Java. It
> would help a lot, if changes are applied to both SDKs at the same time. If
> that is not possible, an easier alternative would be to file a JIRA issue
> so that the work could be tracked in the other SDK.
>
> Ahmet
>
> On Fri, May 19, 2017 at 4:22 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > I see this was implemented. Do we have a policy/guideline for when a
> > name is "bad enough" to merit renaming (and keeping a duplicate,
> > deprecated member around for a year or more).
> >
> > On Mon, May 15, 2017 at 9:25 AM, Robert Bradshaw 
> > wrote:
> > > On Sun, May 14, 2017 at 3:36 PM, Ben Chambers
> > 
> > > wrote:
> > >>
> > >> Exposing the CombineFn is necessary for use with composed combine or
> > >> combining value state. There may be other reasons we made this
> visible,
> > >> but
> > >> these continue to justify it.
> > >
> > >
> > > These are the CompareFns, not the CombineFns.
> > >
> > > It'd be nicer to use the Guava and/or Java8 natural ordering
> comparables,
> > > but they don't promise serializable.
> > >
> > > I agree the naming is unfortunate, but I don't know that it's bad
> enough
> > to
> > > introduce a new name and have duplication and deprecation in the API.
> It
> > > also goes deeper than this; Top.of(...) gives elements in *decreasing*
> > order
> > > while List.sort(...) gives elements in *increasing* order so using a
> > > comparator in one will always produce the opposite effect of using a
> > > comparator in the other.
> > >
> > >>
> > >> On Sun, May 14, 2017, 1:00 PM Reuven Lax 
> > wrote:
> > >>
> > >> > I believe the reason why this is called Top.largest, is that
> > originally
> > >> > it
> > >> > was simply the comparator used by Top.largest - i.e. and
> > implementation
> > >> > detail. At some point it was made public and used by other
> transforms
> > -
> > >> > maybe making an implementation detail a public class was the real
> > >> > mistake?
> > >> >
> > >> > On Sun, May 14, 2017 at 11:45 AM, Davor Bonaci 
> > wrote:
> > >> >
> > >> > > I agree this is an unfortunate name.
> > >> > >
> > >> > > Tangential: can we rename APIs now that the first stable release
> is
> > >> > nearly
> > >> > > done?
> > >> > > Of course -- the "rename" can be done by introducing a new API,
> and
> > >> > > deprecating, but not removing, the old one. Then, once we decide
> to
> > >> > > move
> > >> > to
> > >> > > the next major release, the deprecated API can be removed.
> > >> > >
> > >> > > I think we should probably do the "rename" at some point, but I'd
> > >> > > leave
> > >> > the
> > >> > > final call to the wider consensus.
> > >> > >
> > >> > > On Sat, May 13, 2017 at 5:16 PM, Wesley Tanaka
> > >> > >  > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Using Top.Largest to sort a list of {2,1,3} produces {1,2,3}.
> > This
> > >> > > > matches the javadoc for the class, but seems counter-intuitive
> --
> > >> > > > one
> > >> > > might
> > >> > > > expect that a Comparator called Largest would give largest items
> > >> > > > first.
> > >> > > > I'm wondering if renaming the classes to Natural / Reversed
> would
> > >> > better
> > >> > > > match their behavior?
> > >> > > >
> > >> > > > ---
> > >> > > > Wesley Tanaka
> > >> > > > https://wtanaka.com/
> > >> > >
> > >> >
> > >
> > >
> >
>


Re: First stable release completed!

2017-05-17 Thread Dan Halperin
Great job, folks. What an amazing amount of work, and I'd like to
especially thank the community for participating in hackathons and
extensive release validation over the last few weeks! We caught some
crucial issues in time and really pushed a much better release as a result.

Thanks everyone!
Dan

On Wed, May 17, 2017 at 11:31 AM, Jesse Anderson 
wrote:

> Awesome!
>
> On Wed, May 17, 2017, 8:30 AM Ahmet Altay 
> wrote:
>
> > Congratulations everyone, this is great!
> >
> > On Wed, May 17, 2017 at 7:26 AM, Kenneth Knowles  >
> > wrote:
> >
> > > Awesome. A huge step.
> > >
> > > On Wed, May 17, 2017 at 6:30 AM, Andrew Psaltis <
> > psaltis.and...@gmail.com>
> > > wrote:
> > >
> > > > This is fantastic.  Great job!
> > > > On Wed, May 17, 2017 at 08:20 Jean-Baptiste Onofré 
> > > > wrote:
> > > >
> > > > > Huge congrats to everyone who helped reaching this important
> > milestone
> > > !
> > > > >
> > > > > Honestly, we are a great team, WE ROCK ! ;)
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 05/17/2017 01:28 PM, Davor Bonaci wrote:
> > > > > > The first stable release is now complete!
> > > > > >
> > > > > > Release artifacts are available through various repositories,
> > > including
> > > > > > dist.apache.org, Maven Central, and PyPI. The website is
> updated,
> > > and
> > > > > > announcements are published.
> > > > > >
> > > > > > Apache Software Foundation press release:
> > > > > >
> > > > > http://globenewswire.com/news-release/2017/05/17/986839/0/
> > > > en/The-Apache-Software-Foundation-Announces-Apache-Beam-v2-0-0.html
> > > > > >
> > > > > > Beam blog:
> > > > > > https://beam.apache.org/blog/2017/05/17/beam-first-stable-
> > > release.html
> > > > > >
> > > > > > Congratulations to everyone -- this is a really big milestone for
> > the
> > > > > > project, and I'm proud to be a part of this great community.
> > > > > >
> > > > > > Davor
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > > --
> > > > Thanks,
> > > > Andrew
> > > >
> > > > Subscribe to my book: Streaming Data 
> > > > 
> > > > twiiter: @itmdata  user?screen_name=itmdata>
> > > >
> > >
> >
> --
> Thanks,
>
> Jesse
>


Re: [VOTE] First stable release: release candidate #4

2017-05-15 Thread Dan Halperin
+1

In addition to the review for RC2/RC3 notes in the Acceptance Criteria doc,
I've manually verified that RC4 releases staged on dist.apache.org do not
include binary artifacts :).

Thanks everyone!

On Mon, May 15, 2017 at 11:54 PM, Pei HE  wrote:

> +1
>
> Given users can run WordCount with input file path:
> "C:/Users/Pei/input.txt", two JIRAs that I filed are not release blocker
> IMO.
> Maybe this worth a note on Java SDK Quickstart?
>
> Thanks Luke for setting up tests on Windows OS. We can improve the dev
> experience on Windows overtime.
>
> Looking forward to the stable release.
>
> --
> Pei
>
> On Tue, May 16, 2017 at 4:03 AM, Thomas Groh 
> wrote:
>
> > +1
> >
> > Since the last candidate, I've also run the game examples for a few hours
> > on the DirectRunner and all's well.
> >
> > On Mon, May 15, 2017 at 9:16 AM, Lukasz Cwik 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Pei, I filed https://issues.apache.org/jira/browse/BEAM-2283 about
> > using a
> > > consistent strategy when dealing with URIs/string like paths in our
> APIs
> > > and related the bugs that you filed to it.
> > >
> > >
> > > On Mon, May 15, 2017 at 6:56 AM, Pei HE  wrote:
> > >
> > > > Currently, several unit tests fail in Windows OS, and the beam repo
> > fails
> > > > to build. (tested in Windows 7)
> > > >
> > > > (Then, I built the jar manually in mac, and copied it to Windows OS)
> > > >
> > > > Found WordCount also doesn't work with Windows OS local files.
> > > >
> > > > Filed two jira:
> > > > https://issues.apache.org/jira/browse/BEAM-2298
> > > > https://issues.apache.org/jira/browse/BEAM-2299
> > > >
> > > >
> > > > On Mon, May 15, 2017 at 6:30 AM, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Verified signatures
> > > > > Ran Quickstart code (WordCound, HourlyTeamScore, UserScore) on
> Flink
> > on
> > > > > YARN on Dataproc
> > > > >
> > > > > > On 14. May 2017, at 21:06, Chamikara Jayalath <
> > chamik...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Verified Python SDK examples (WordCount, BigQuery tornadoes,
> > > UserScore,
> > > > > and
> > > > > > HourlyTeamScore)  on Windows for DirectRunner and DataflowRunner.
> > > > > >
> > > > > > Verified all checksums and signatures.
> > > > > >
> > > > > > Thanks,
> > > > > > Cham
> > > > > >
> > > > > >
> > > > > > On Sun, May 14, 2017 at 3:15 AM Ismaël Mejía 
> > > > wrote:
> > > > > >
> > > > > >> +1 (non-binding)
> > > > > >>
> > > > > >> Validated signatures OK
> > > > > >> Run mvn clean verify -Prelease OK
> > > > > >> Executed Nexmark with Direct/Spark/Flink/Apex runners in local
> > mode
> > > > > >> (temporally downgraded to 2.0.0 to validate the version). OK
> > > > > >>
> > > > > >> This is looking great now. As Robert said, a release to be proud
> > of.
> > > > > >>
> > > > > >> On Sun, May 14, 2017 at 8:25 AM, Robert Bradshaw
> > > > > >>  wrote:
> > > > > >>> +1
> > > > > >>>
> > > > > >>> Verified all the checksums and signatures. (Now that both md5
> and
> > > > sha1
> > > > > >>> are broken, we should probably provide sha-256 as well.)
> > > > > >>>
> > > > > >>> Spot checked the site and documentation, left comments on the
> PR.
> > > The
> > > > > >>> main landing page has nothing about the Beam stable release,
> and
> > > the
> > > > > >>> top blog entry (right in the center) mentions 0.6.0 which
> catches
> > > the
> > > > > >>> eye. I assume a 2.0.0 blog will be here shortly?
> > > > > >>>
> > > > > >>> Ran a couple of trivial but novel direct-runner pipelines
> (Python
> > > and
> > > > > >> Java).
> > > > > >>>
> > > > > >>> https://github.com/tensorflow/transform is pinning 0.6.0, so
> we
> > > > won't
> > > > > >>> break them (though hopefully they'll upgrade to >=2.0.0 shortly
> > > after
> > > > > >>> the release).
> > > > > >>>
> > > > > >>> The Python zipfile at
> > > > > >>> https://dist.apache.org/repos/dist/dev/beam/2.0.0-RC4/ is
> > missing
> > > > > >>> sdks/python/apache_beam/transforms/trigger_transcripts.yaml.
> > This
> > > > will
> > > > > >>> cause some tests to be skipped (but no failure). However, I
> don't
> > > > > >>> think it's worth cutting another release candidate for.
> > > > > >>>
> > > > > >>> Everything else is looking great. This is a release to be proud
> > of!
> > > > > >>>
> > > > > >>> - Robert
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Sat, May 13, 2017 at 8:40 PM, Mingmin Xu <
> mingm...@gmail.com>
> > > > > wrote:
> > > > >  +1
> > > > > 
> > > > >  Test beam-examples with FlinkRunner, and several cases of
> > > > > >> KafkaIO/JdbcIO.
> > > > > 
> > > > >  Thanks!
> > > > >  Mingmin
> > > > > 
> > > > >  On Sat, May 13, 2017 at 7:38 PM, Ahmet Altay
> > > >  > > > > >
> > > > >  wrote:
> > > > > 
> 

Re: First stable release: Acceptance criteria

2017-05-11 Thread Dan Halperin
I'm focusing on:

* user reported bugs (Avro, TextIO, MongoDb)
* the actual Apache Release criteria (licensing, dependencies, etc.)

On Thu, May 11, 2017 at 3:04 PM, Lukasz Cwik 
wrote:

> I have been trying out various Python scenarios on Windows.
>
> On Thu, May 11, 2017 at 3:01 PM, Jason Kuster <
> jasonkus...@google.com.invalid> wrote:
>
> > I'll try to get wordcount running against a Spark cluster.
> >
> > On Wed, May 10, 2017 at 10:32 PM, Davor Bonaci  wrote:
> >
> > > Just a quick remainder to consider to consider contributing here.
> > >
> > > We are now at 6 criteria -- thanks!
> > >
> > > On Tue, May 9, 2017 at 2:29 AM, Aljoscha Krettek 
> > > wrote:
> > >
> > > > Thanks for starting this document!
> > > >
> > > > I added a criterion and also verified it on the current RC.
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > > > On 8. May 2017, at 22:48, Davor Bonaci  wrote:
> > > > >
> > > > > Based on the process previously discussed [1], I've seeded the
> > > acceptance
> > > > > criteria document [2].
> > > > >
> > > > > Please consider contributing to this effort by:
> > > > > * proposing additional acceptance criteria, and/or
> > > > > * supporting criteria proposed by others, and/or
> > > > > * validating a criteria.
> > > > >
> > > > > Please note that acceptance criteria shouldn't been too deep or too
> > > broad
> > > > > -- those are covered by automated tests and hackathon we had
> earlier.
> > > > This
> > > > > should be "sanity-check"-type of criteria: simple, surface-level
> > > things.
> > > > >
> > > > > If you discover issues while validating a criteria, please:
> > > > > * file a new JIRA issue, tag it as Fix Versions: “2.0.0”, and
> > > > > * post on the dev@ mailing list on the thread about that specific
> > > > release
> > > > > candidate.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Davor
> > > > >
> > > > > [1]
> > > > > https://lists.apache.org/thread.html/
> 37caa5a94cec1405638410857f489d
> > > > 7cf7fa12bbe3c36e9925b2d6e2@%3Cdev.beam.apache.org%3E
> > > > > [2]
> > > > > https://docs.google.com/document/d/1XwojJ4Mj3wSlnBO1YlBs51P8kuGyg
> > > > YRj2lrNrqmAUvo/
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>


Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Dan Halperin
(we should probably throw an exception at construction time in the various
FileBasedSinks if you use WindowedWrites and the default filename policy
though, that's a no-brainer and it's backwards-compatible.)

On Thu, May 11, 2017 at 8:41 AM, Dan Halperin <dhalp...@google.com> wrote:

> +Eugene, Reuven who reviewed and implemented this code. They may have
> opinions.
>
> Note that changing the default filename policy would be
> backwards-incompatible, so this would either need to go into 2.0.0 (and a
> new RC3) or it would not go in.
>
> On Thu, May 11, 2017 at 8:36 AM, Borisa Zivkovic <
> borisha.zivko...@gmail.com> wrote:
>
>> great JB, thanks
>>
>> I do not mind working on this - let's see if anyone else has additional
>> input.
>>
>> cheers
>>
>> On Thu, 11 May 2017 at 16:28 Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>> > Got it.
>> >
>> > Yes, agree, I think the PerWindowFilesPolicy could be the default and
>> let
>> > the
>> > user provides its own policy if he wants to.
>> >
>> > Regards
>> > JB
>> >
>> > On 05/11/2017 05:23 PM, Borisa Zivkovic wrote:
>> > > Hi  JB,
>> > >
>> > > yes I saw that thread - I also copied your code but did not want to
>> > pollute
>> > > it with my proposal :)
>> > >
>> > > Well ok maybe default FilePerWindow policy for windowedWrites in
>> TextIO
>> > > does not make sense - not sure TBH...
>> > >
>> > > But would it make sense to promote a version of PerWindowFiles from
>> > >
>> > https://github.com/jbonofre/beam-samples/blob/master/iot/src
>> /main/java/org/apache/beam/samples/iot/JmsToHdfs.java
>> > > so that it is easier to provide some kind of PerWindowFiles filename
>> > > policy..
>> > >
>> > >
>> > > something like (where user does not have to write
>> PerWindowFilesPolicy,
>> > it
>> > > comes with Beam)
>> > >
>> > >
>> > >
>> > > .withFilenamePolicy(PerWindowFilesPolicy.withSuffix("mySuffix"))
>> > > .withWindowedWrites()
>> > > .withNumShards(1));
>> > >
>> > > not sure if this was already discussed...
>> > >
>> > > cheers
>> > > Borisa
>> > >
>> > >
>> > > On Thu, 11 May 2017 at 16:15 Jean-Baptiste Onofré <j...@nanthrax.net>
>> > wrote:
>> > >
>> > >> Hi Borisa,
>> > >>
>> > >> You can take a look about the other thread ("Direct runner doesn't
>> seem
>> > to
>> > >> finalize checkpoint "quickly"").
>> > >>
>> > >> It's basically the same point ;)
>> > >>
>> > >> The default trigger (event-time) doesn't fire any data. I'm
>> > investigating
>> > >> the
>> > >> element timestamp and watermark.
>> > >>
>> > >> I'm also playing with that, for instance:
>> > >>
>> > >>
>> > >>
>> > https://github.com/jbonofre/beam-samples/blob/master/iot/src
>> /main/java/org/apache/beam/samples/iot/JmsToHdfs.java
>> > >>
>> > >> When you use WindowedWrite, you have to provide a filename policy. We
>> > could
>> > >> provide a default one, but not sure it will fit fine (as it depends a
>> > lot
>> > >> about
>> > >> the use cases).
>> > >>
>> > >> Regards
>> > >> JB
>> > >>
>> > >> On 05/11/2017 05:01 PM, Borisa Zivkovic wrote:
>> > >>> Hi guys,
>> > >>>
>> > >>> just playing with reading data from PubSub and writing using TextIO.
>> > >>>
>> > >>> First thing is that it is very hard to get any output - a lot of
>> temp
>> > >> files
>> > >>> written but not always would get final files created.
>> > >>>
>> > >>> So, I am playing with triggers etc... If I do following
>> > >>>
>> > >>> PCollection streamData = p.apply(
>> > >>> PubsubIO.readStrings().fromTopic("projects/"+ PROJECT_NAME
>> +
>> > >>> "/topics/myTopic"));
>> > >>>
>> > >>

Re: Process for getting the first stable release out

2017-05-08 Thread Dan Halperin
I like putting in master then CPing into release, because we should have a
high bar for what goes into release. It should absolutely NOT default to
everything; we should have to justify everything.

E.g., https://github.com/apache/beam/pull/2958 - where I open the CP but
suggest this may not be worthy for release as it's just cleaning up logs
and errors.

On Mon, May 8, 2017 at 1:10 PM, Robert Bradshaw <rober...@google.com.invalid
> wrote:

> On Mon, May 8, 2017 at 12:57 PM, Davor Bonaci <da...@apache.org> wrote:
> > We cannot do (clean) merges; both branches contain unwanted changes in
> the
> > other branch. So, we have to cherry-pick regardless where we merge first.
>
> Shouldn't the set of changes wanted in release but not in master be
> quite small (if any)? In that case, one could do an explicit revert
> following merge from release to master, when needed. The extra work
> scales in terms of how much we want to diverge (rather than all the
> changes we want in release that should also be in master, which is the
> bulk of them, touched by significantly more people.)
>
> > With post commits running automatically on master only, that seems like a
> > logical starting point. But, it doesn't matter really -- either way
> works.
> >
> > On Mon, May 8, 2017 at 12:30 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> >> An alternative strategy, given the number of outstanding changes,
> >> would be to create release-intended PRs against the release branch
> >> itself, then periodically merge back to master. This would reduce the
> >> need for manual (and error-prone) cherry-picking.
> >>
> >> On Fri, May 5, 2017 at 5:21 PM, Davor Bonaci <da...@apache.org> wrote:
> >> > The release branch is now created [1]. Anything for the first stable
> >> > release should go into master as usual, and then get cherry-picked
> into
> >> the
> >> > release branch.
> >> >
> >> > I'll create the first RC shortly and then start a doc around the
> >> acceptance
> >> > criteria.
> >> >
> >> > From this point onward, backward-incompatible changes should *not* be
> >> > merged to master, unless they are also getting cherry-picked into the
> >> > release branch.
> >> >
> >> > Davor
> >> >
> >> > [1] https://github.com/apache/beam/tree/release-2.0.0
> >> >
> >> > On Fri, May 5, 2017 at 1:57 PM, Thomas Groh <tg...@google.com.invalid
> >
> >> > wrote:
> >> >
> >> >> I'm also +1 on the branch. It'll help us make sure that what we're
> >> getting
> >> >> in is what we need for the FSR.
> >> >>
> >> >> On Fri, May 5, 2017 at 12:41 PM, Dan Halperin <dhalp...@apache.org>
> >> wrote:
> >> >>
> >> >> > I am +1 on cutting the branch, and the sentiment that we expect the
> >> first
> >> >> > pancake
> >> >> > <https://www.quora.com/Why-do-you-have-to-throw-out-the-
> first-pancake
> >> >
> >> >> > will
> >> >> > be not ready to serve customers.
> >> >> >
> >> >> > On Fri, May 5, 2017 at 11:40 AM, Kenneth Knowles
> >> <k...@google.com.invalid
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> > > On Thu, May 4, 2017 at 12:07 PM, Davor Bonaci <da...@apache.org>
> >> >> wrote:
> >> >> > >
> >> >> > > > I'd like to propose the following (tweaked) process for this
> >> special
> >> >> > > > release:
> >> >> > > >
> >> >> > > > * Create a release branch, and start building release
> candidates
> >> >> *now*
> >> >> > > > This would accelerate branch creation compared to the normal
> >> process,
> >> >> > but
> >> >> > > > would separate the first stable release from other development
> on
> >> the
> >> >> > > > master branch. This yields to stability and avoids unnecessary
> >> churn.
> >> >> > > >
> >> >> > >
> >> >> > > +1 to cutting a release branch now.
> >> >> > >
> >> >> > > This sounds compatible with the release process [1] to me,
> actually.
> >> >> This
> >> >> > > thread seems like the dev@ thread where we "decide to release"
> and
> >> I
> >> >> > agree
> >> >> > > that we should decide to release. Certainly `master` is not ready
> >> nor
> >> >> is
> >> >> > > the web site - there are ~29 issues as I write this though many
> are
> >> not
> >> >> > > really significant code changes. But we should never wait until
> >> >> `master`
> >> >> > is
> >> >> > > "ready".
> >> >> > >
> >> >> > > We know what we want to get done, and there are no radical
> changes,
> >> so
> >> >> I
> >> >> > > think that makes this the right time to branch. We can easily
> cherry
> >> >> pick
> >> >> > > fixes for our burndown list to ensure we don't introduce
> additional
> >> >> > > blockers.
> >> >> > >
> >> >> > > Some of the burndown list are of the form "investigate if this
> >> >> suspected
> >> >> > > bug still repros" and a release candidate is the perfect thing to
> >> use
> >> >> for
> >> >> > > that.
> >> >> > >
> >> >> > > [1] https://beam.apache.org/contribute/release-guide/#
> >> >> decide-to-release
> >> >> > >
> >> >> >
> >> >>
> >>
>


Re: Process for getting the first stable release out

2017-05-05 Thread Dan Halperin
I am +1 on cutting the branch, and the sentiment that we expect the first
pancake
 will
be not ready to serve customers.

On Fri, May 5, 2017 at 11:40 AM, Kenneth Knowles 
wrote:

> On Thu, May 4, 2017 at 12:07 PM, Davor Bonaci  wrote:
>
> > I'd like to propose the following (tweaked) process for this special
> > release:
> >
> > * Create a release branch, and start building release candidates *now*
> > This would accelerate branch creation compared to the normal process, but
> > would separate the first stable release from other development on the
> > master branch. This yields to stability and avoids unnecessary churn.
> >
>
> +1 to cutting a release branch now.
>
> This sounds compatible with the release process [1] to me, actually. This
> thread seems like the dev@ thread where we "decide to release" and I agree
> that we should decide to release. Certainly `master` is not ready nor is
> the web site - there are ~29 issues as I write this though many are not
> really significant code changes. But we should never wait until `master` is
> "ready".
>
> We know what we want to get done, and there are no radical changes, so I
> think that makes this the right time to branch. We can easily cherry pick
> fixes for our burndown list to ensure we don't introduce additional
> blockers.
>
> Some of the burndown list are of the form "investigate if this suspected
> bug still repros" and a release candidate is the perfect thing to use for
> that.
>
> [1] https://beam.apache.org/contribute/release-guide/#decide-to-release
>


Re: Slack Invites

2017-05-04 Thread Dan Halperin
My understanding is that if you use something like that plugin, and they
detect it, Slack will ban you from new invites entirely or otherwise punish
you. They want this friction for free projects so that there's pressure to
pay.

On Thu, May 4, 2017 at 9:18 AM, Jesse Anderson 
wrote:

> Is possible to change how Slack invites are handled? This might encourage
> our community contributions.
>
> Right now, people have to email in (causing extra dev@/user@ emails). I
> did
> a quick search and found this  so
> people
> can invite themselves.
>
> Thanks,
>
> Jesse
> --
> Thanks,
>
> Jesse
>


Re: Status of our CI tools

2017-04-30 Thread Dan Halperin
I think the confusion to new users is much worse than any temporary loss of
functionality here. +1 * 100!

On Fri, Apr 28, 2017 at 11:00 PM, Mingmin Xu  wrote:

> +1
> Have ignored TravisCI for some time as the failures are not related with
> code/test issues.
>
> I still hope TravisCI could work with Beam code repository some day, to
> run tests before creating a PR.
>
> Mingmin
>
> > On Apr 28, 2017, at 10:26 PM, Aljoscha Krettek 
> wrote:
> >
> > Big +1
> >
> >> On 29. Apr 2017, at 07:21, Robert Bradshaw 
> wrote:
> >>
> >> On Fri, Apr 28, 2017 at 9:56 PM, Jean-Baptiste Onofré 
> wrote:
> >>> +1
> >>>
> >>> Travis is useless and our Jenkins is good IMHO !
> >>
> >> Travis is really useful for the Python SDK, but I'm hopeful that soon
> >> Jenkins will be stable and quick enough that I won't miss it, and
> >> having only one CI to deal with should simplify things.
> >>
> >> - Robert
> >
>


Re: An Update on Jenkins

2017-04-26 Thread Dan Halperin
> If not, feel free to reply to this thread

... not. :) :(

On Tue, Apr 25, 2017 at 9:58 PM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update !
>
> Regards
> JB
>
> On Apr 26, 2017, 05:51, at 05:51, Jason Kuster 
> 
> wrote:
> >Hey folks,
> >
> >There have been a couple of different issues over the last couple of
> >days
> >related to some necessary updates Infra has been working on. We've
> >tracked
> >down the last couple of issues, and the latest one seems to be that
> >we're
> >being hit by the rate limiter as a result of everything starting back
> >up
> >again. They expect that waiting a couple of hours should solve the
> >problem,
> >so hopefully by tomorrow things will be back to normal. If not, feel
> >free
> >to reply to this thread, and I'll try to keep things up to date with
> >status.
> >
> >Best,
> >
> >Jason
> >
> >--
> >---
> >Jason Kuster
> >Apache Beam / Google Cloud Dataflow
>


Re: Naming of Combine.Globally

2017-04-18 Thread Dan Halperin
Great discussion! As Aljoscha says, Fold, Reduce, and Combine are all
intertwined and not quite identical as we use them.

Another simple but perhaps coy answer is that if you read the MapReduce
paper by Dean and Ghemawat that started this all, they used "Map",
"Reduce", and "Combine" (see section 4.3:
https://research.google.com/archive/mapreduce.html)

So then it's likely just the lineage of Beam as "evolving from MapReduce"
:). [Looking around the source tree: we have MapElements, ReduceFn, and
Combine. And the DataflowRunner has Shuffle inside of GroupByKey. ;)]

Dan

On Tue, Apr 18, 2017 at 3:16 AM, Aljoscha Krettek 
wrote:

> The definition of foldl in Haskell is the same as the description I gave
> earlier:
>
> foldl :: (a -> b -> a) -> a -> [b] -> a
>
> The function (a -> b -> a) is what I described as (T, A) -> A and it’s
> used to fold a list of b’s into an a (the accumulator type).
>
> You’re right that the mapping AccumT->OutputT is not important and could
> be delegated to a separate method. The important part of the interface is
> mergeAccumulators() since this makes the operation distributive: we can
> “fold” a bunch of Ts into As in parallel (even on different machines) and
> then merge them together. This is what is missing from a functional fold.
>
> Best,
> Aljoscha
>
>
> > On 18. Apr 2017, at 12:03, Wesley Tanaka 
> wrote:
> >
> > I believe that foldl in Haskell https://www.haskell.org/
> hoogle/?hoogle=foldl admits a separate accumulator type from the type of
> the data structure being "folded"
> > And, well, python lets you have your way with mixing types, but this
> certainly works as another example:python -c "print(reduce(lambda ac, elem:
> '%s%d' % (ac,elem), [1,2,3,4,5], ''))"
> > Is there anything special about the AccumT->OutputT conversion that
> extractOutput() needs to be in the same interface as createAccumulator(),
> addInput() and mergeAccumulators()?  If the interface were segregated such
> that one interface managed the InputT->AccumT conversion, and the second
> managed the AccumT->InputT conversion, it seems like maybe the
> AccumT->OutputT conversion could even get replaced with MapElements?  And
> then the full current "Combine" functionality could be implemented as a
> composition of the lower-level primitives?
> > I haven't dug that deeply into Combine yet, so I may be missing
> something obvious.
> > ---
> > Wesley Tanaka
> > https://wtanaka.com/
> >
> > On Monday, April 17, 2017, 11:32:29 PM HST, Aljoscha Krettek <
> aljos...@apache.org> wrote:Hi,
> > I think both fold and reduce fail to capture all the power or (what we
> call) combine. Reduce requires a function of type (T, T) -> T. It requires
> that the output type be the same as the input type. Fold takes a function
> (T, A) -> A where T is the input type and A is the accumulation type. Here,
> the output type can be different from the input type. However, there is no
> way of combining these aggregators so the operation is not distributive,
> i.e. we cannot hierarchically apply the operation.
> >
> > Combine is the generalisation of this: We have three types, T (input), A
> (accumulator), O (output) and we require a function that can merge
> accumulators. The operation is distributive, meaning we can efficiently
> execute it and we can also have an output type that is different from the
> input type.
> >
> > Quick FYI: in Flink the CombineFn is called AggregatingFunction and
> CombiningState is AggregatingState.
> >
> > Best,
> > Aljoscha
> >> On 18. Apr 2017, at 04:29, Wesley Tanaka 
> wrote:
> >>
> >> As I start to understand Combine.Globally, it seems that it is, in
> spirit, Beam's implementation of the "fold" higher-order function
> >> https://en.wikipedia.org/wiki/Fold_(higher-order_function)#
> Folds_in_various_languages
> >>
> >> Was there a reason the word "combine" was picked instead of either
> "fold" or "reduce"?  From the wikipedia list above, it seems as though
> "fold" and "reduce" are in much more common usage, so either of those might
> be easier for newcomers to understand.
> >> ---
> >> Wesley Tanaka
> >> http://wtanaka.com/
>
>


Re: [DISCUSSION] PAssert success/failure count validation for all runners

2017-04-17 Thread Dan Halperin
(I'll also note that the bit about URNs and whatnot is decouplable -- we
have Pipeline surgery APIs right now, and will someday have
URN-with-payload-based-surgery APIs, but we can certainly do the work to
make PAssert more overridable now and be ready for full Runner API work
later).

On Mon, Apr 17, 2017 at 11:14 AM, Dan Halperin <dhalp...@google.com> wrote:

> I believe Pablo's existing proposal is here: https://lists.apache.
> org/thread.html/CADJfNJBEuWYhhH1mzMwwvUL9Wv2HyFc8_E=9zYBKwUgT8ca1HA@mail.
> gmail.com
>
> The idea is that we'll stick with the current design -- aggregator- (but
> now metric)-driven validation of PAssert. Runners that do not support these
> things can override the validation step to do something different.
>
> This seems to me to satisfy all parties and unblock removal of
> aggregators. If a runner supports aggregators but not metrics because the
> semantics are slightly different, that runner can override the behavior.
>
> I agree that all runners doing sensible things with PAssert should be a
> first stable release blocker. But I do not think it's important that all
> runners verify them the same way. There has been no proposal that provides
> a single validation mechanism that works well with all runners.
>
> On Wed, Apr 12, 2017 at 9:24 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
>> That sounds very good! Now we only have to manage to get this in before
>> the first stable release because I think this is a very important signal
>> for ensuring Runner correctness.
>>
>> @Pablo Do you already have plans regarding 3., i.e. stable URNs for the
>> assertions. And also for verifying them in a runner-agnostic way in
>> TestStream, i.e. https://issues.apache.org/jira/browse/BEAM-1763? <
>> https://issues.apache.org/jira/browse/BEAM-1763?>
>>
>> Best,
>> Aljoscha
>>
>> > On 10. Apr 2017, at 10:10, Kenneth Knowles <k...@google.com.INVALID>
>> wrote:
>> >
>> > On Sat, Apr 8, 2017 at 7:00 AM, Aljoscha Krettek <aljos...@apache.org>
>> > wrote:
>> >
>> >> @kenn What’s the design you’re mentioning? (I probably missed it
>> because
>> >> I’m not completely up to data on the Jiras and ML because of Flink
>> Forward
>> >> preparations)
>> >>
>> >
>> > There are three parts (I hope I say this in a way that makes everyone
>> happy)
>> >
>> > 1. Each assertion transform is followed by a verifier transform that
>> fails
>> > if it sees a non-success (in addition to bumping metrics).
>> > 2. Use the same trick PAssert already uses, flatten in a dummy value to
>> > reduce the risk that the verifier transform never runs.
>> > 3. Stable URNs for the assertion and verifier transforms so a runner
>> has a
>> > good chance to wire custom implementations, if it helps.
>> >
>> > I think someone mentioned it earlier, but these also work better with
>> > metrics that overcount, since it is now about covering the verifier
>> > transforms rather than an absolute number of successes.
>> >
>> > Kenn
>> >
>> >
>> >>> On 7. Apr 2017, at 12:42, Kenneth Knowles <k...@google.com.INVALID>
>> >> wrote:
>> >>>
>> >>> We also have a design that improves the signal even without metrics,
>> so
>> >> I'm
>> >>> pretty happy with this.
>> >>>
>> >>> On Fri, Apr 7, 2017 at 12:12 PM, Lukasz Cwik <lc...@google.com.invalid
>> >
>> >>> wrote:
>> >>>
>> >>>> I like the usage of metrics since it doesn't depend on external
>> >> resources.
>> >>>> I believe there could be some small amount of code shared between
>> >> runners
>> >>>> for the PAssert metric verification.
>> >>>>
>> >>>> I would say that PAssert by itself and PAssert with metrics are two
>> >> levels
>> >>>> of testing available. For runners that don't support metrics than
>> >> PAssert
>> >>>> gives a signal (albeit weaker one) and ones that do support metrics
>> will
>> >>>> have a stronger signal for execution correctness.
>> >>>>
>> >>>> On Fri, Apr 7, 2017 at 11:59 AM, Aviem Zur <aviem...@gmail.com>
>> wrote:
>> >>>>
>> >>>>> Currently, PAssert assertions may not happen and tests will pass
>> while
>> >>>>> silently hiding i

Re: Join to external table

2017-04-14 Thread Dan Halperin
Hi Jingsong,

This seems like a fantastic, reusable pattern to add, and indeed it's a
fairly common one. There are probably some interesting API issues too --
such as how you make a nice clean interface that works for many backends
(Bigtable? HBase? Redis? Memcache? etc.), and how you let users supply a
caching policy.

It sounds like you may have already worked through these -- would you like
to write down what you've learned and send out a short proposal?

Thanks!

On Thu, Apr 13, 2017 at 8:40 AM, JingsongLee 
wrote:

> Hi all,
>
>
> I've seen repeatedly the following pattern:
> Consider a sql (Joining stream to table, from Calcite):
> SELECT STREAM o.rowtime, o.productId, o.orderId, o.units,
>   p.name, p.unitPrice
> FROM Orders AS o
> JOIN Products AS p
>   ON o.productId = p.productId;
> A stream-to-table join is straightforward if the contents of the table are
> not
> changing(or slowly changing). This query enriches a stream of orders with
> each product’s list price.
>
> This table is mostly in HBase or Mysql or Redis. Most of our users think
> that
> we should use SideInputs to implement it. But there are some difficulties
> here:
> 1.Maybe this table is very large! AFAIK, SideInputs will load all data to
> internal.
> We can not load all, but we can do some caching work.
> 2.This table may be updated periodically. As mentioned in
> https://issues.apache.org/jira/browse/BEAM-1197
> 3.Sometimes users want to update this table, in some scene which doesn’t
> need high accuracy. (The read and write to the external storage can’t
> guarantee
> Exacly-Once)
>
> So we developed a component called DimState(Maybe the name is not right).
> Use cache(It is LoadingCache now) or load all.  They all have Time-To-Live
> mechanism. An abstract interface is called ExternalState. There are
> HBaseState, JDBCState, RedisState. It is accessed by key and namespace.
> Provides bulk access to the external table for performance.
>
> Is there a better way to implement it? Can we make some abstracts in Beam
> Model?
>
> What do you think?
>
> Best,
> JingsongLee
>


Re: [DISCUSSION] Consistent use of loggers

2017-04-12 Thread Dan Halperin
For examples (which I think is auto-propagated to examples archetype), and
I think also manually done for starter archetype:

* Every runner, including DirectRunner, is in a profile: -Pdirect-runner:
https://github.com/apache/beam/blob/master/examples/java/pom.xml#L43
* The slf4j-jdk14 is already already used, but not in a profile:
https://github.com/apache/beam/blob/master/examples/java/pom.xml#L517

I would be supportive of moving the slf4j-jdk14 to only Direct and Dataflow
profiles if that is what you think is best.

And yes, the intent is that only 1 runner profile in these cases is active
at a time.

On Thu, Apr 6, 2017 at 9:21 PM, Aviem Zur <aviem...@gmail.com> wrote:

> >IMO I don't think the DirectRunner should depend directly on any specific
> logging
> backend (at least, not in the compile or runtime scopes). I think it should
> depend on JUL in the test scope, so that there are logs when executing
> DirectRunner tests.
> >My reasoning: I can see in any binary version of Beam that the SDK, the
> DirectRunner,
> and 1 or more other runners will all be on the classpath.
> >Ideally this should work regardless of whatever other runner is used;
> presumably
> the DirectRunner would "automagically" pick up the logging config of the
> other runner.
> That sounds like a very plausible scenario and this would "protect" the
> runner's binding from an intruding binding from direct runner, since it
> would have no binding.
> However, there is also the scenario that a user runs the examples using
> direct runner,
> this is their first interaction with Beam, and they see no logs whatsoever,
> they would have to add a binding.
> We could solve this by adding a binding in the 'direct-runner' profile in
> examples module and the maven archetypes (And allow only one runner profile
> to be specified at a time, in case their logger binding clashes).
>
> >I like the use of slf4j as it enables lots of publishers of logs, but I
> don't
> want to supply a default/required consumer of logs because that will
> restrict
> use cases in the future...
> I agree, forcing log4j binding might give the user a false sense of: "all
> runners use log4j" while this might not be true for future (and isn't true
> today, for Dataflow runner), but we can't assure that future runners could
> support this.
>
> So it seems we're left with:
> 1) Add documentation around logging in each runner.
> 2) Consider enabling a binding (JUL) for direct runner profile in examples
> module and maven archetypes.
> 3) Allow only one runner profile to be active at a time in examples and
> maven archetypes as their logger binding might clash.
>
> Thoughts?
>
> On Tue, Apr 4, 2017 at 8:51 AM Dan Halperin <dhalp...@google.com.invalid>
> wrote:
>
> > At this point, I'm a little unclear on what is the proposal. Can you
> > refresh a simplified/aggregated view after this conversation?
> >
> > IMO I don't think the DirectRunner should depend directly on any specific
> > logging backend (at least, not in the compile or runtime scopes). I think
> > it should depend on JUL in the test scope, so that there are logs when
> > executing DirectRunner tests.
> >
> > My reasoning: I can see in any binary version of Beam that the SDK, the
> > DirectRunner, and 1 or more other runners will all be on the classpath.
> > Ideally this should work regardless of whatever other runner is used;
> > presumably the DirectRunner would "automagically" pick up the logging
> > config of the other runner.
> >
> > I like the use of slf4j as it enables lots of publishers of logs, but I
> > don't want to supply a default/required consumer of logs because that
> will
> > restrict use cases in the future...
> >
> > On Mon, Apr 3, 2017 at 8:14 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Fair enough. +1 especially for the documentation.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 04/03/2017 08:48 PM, Aviem Zur wrote:
> > >
> > >> Upon further inspection there seems to be an issue we may have
> > overlooked:
> > >> In cluster mode, some of the runners will have dependencies added
> > directly
> > >> to the classpath by the cluster, and since SLF4J can only work with
> one
> > >> binding, the first one in the classpath will be used.
> > >>
> > >> So while what we suggested would work in local mode, the user's chosen
> > >> binding and configuration might be ignored in cluster mode, which is
> > >> detrimental to what we wanted to accomplish.
> >

Re: Style: how much testing for transform builder classes?

2017-03-21 Thread Dan Halperin
https://github.com/apache/beam/commit/b202548323b4d59b11bbdf06c99d0f99e6a947ef
is one example where tests of feature Bar exist but did not discover bugs
that could be introduced by builders.

AutoValue like alleviates many, but not all, of these concerns - as Ismael
points out.



On Tue, Mar 21, 2017 at 1:18 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Wed, Mar 15, 2017 at 2:11 AM, Ismaël Mejía  wrote:
>
> > +1 to Vikas point maybe the right place to enforce things correct
> > build tests is in the validate and like this reduce the test
> > boilerplate and only test the validate, but I wonder if this totally
> > covers both cases (the buildsCorrectly and
> > buildsCorrectlyInDifferentOrder ones).
> >
> > I answer Eugene’s question here even if you are aware now since you
> > commented in the PR, so everyone understands the case.
> >
> > The case is pretty simple, when you extend an IO and add a new
> > configuration parameter, suppose we have withFoo(String foo) and we
> > want to add withBar(String bar). In some cases the implementation or
> > even worse the combination of those are not built correctly, so the
> > only way to guarantee that this works is to have code that tests the
> > complete parameter combination or tests that at least assert that the
> > object is built correctly.
> >
> > This is something that can happen both with or without AutoValue
> > because the with method is hand-written and the natural tendency with
> > boilerplate methods like this is to copy/paste, so we can end up doing
> > silly things like:
> >
> > private Read(String foo, String bar) { … }
> >
> > public Read withBar(String bar) {
> >   return new Read(foo, null);
> > }
> >
> > in this case the reference to bar is not stored or assigned (this is
> > similar to the case of the DatastoreIO PR), and AutoValue may seem to
> > solve this issue but you can also end up with this situation if you
> > copy paste the withFoo method and just change the method name:
> >
> > public Read withBar(String foo) {
> >   return builder().setFoo(foo).build();
> > }
> >
> > Of course both seem silly but both can happen and the tests at least
> > help to discover those,
> >
>
> Such mistakes should be entirely discovered by tests of feature Bar. If Bar
> is not actually being tested, that's a bigger problem with coverage that a
> construction-only test actually obscures (giving it negative value).
>
>
> >
> > On Wed, Mar 15, 2017 at 1:05 AM, vikas rk  wrote:
> > > Yes, what I meant is: Necessary tests are ones that blocks users if not
> > > present. Trivial or non-trivial shouldn't be the issue in such cases.
> > >
> > > Some of the boilerplate code and tests is because IO PTransforms are
> > > returned to the user before they are fully constructed and actual
> > > validation happens in the validate method rather than at construction.
> I
> > > understand that the reasoning here is that we want to support
> > constructing
> > > them with options in any order and using Builder pattern can be
> > confusing.
> > >
> > > If validate method is where all the validation happens, then we should
> > able
> > > to eliminate some redundant checks and tests during construction time
> > like
> > > in *withOption* methods here
> > >  > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > io/gcp/bigtable/BigtableIO.java#L199>
> > >  and here
> > >  > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > io/gcp/datastore/DatastoreV1.java#L387>
> > > as
> > > these are also checked in the validate method.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > -Vikas
> > >
> > >
> > >
> > > On 14 March 2017 at 15:40, Eugene Kirpichov
>  > >
> > > wrote:
> > >
> > >> Thanks all. Looks like people are on board with the general direction
> > >> though it remains to refine it to concrete guidelines to go into style
> > >> guide.
> > >>
> > >> Ismaël, can you give more details about the situation you described in
> > the
> > >> first paragraph? Is it perhaps that really a RunnableOnService test
> was
> > >> missing (and perhaps still is), rather than a builder test?
> > >>
> > >> Vikas, regarding trivial tests and user waiting for a work-around: in
> > the
> > >> situation I described, they don't really need a workaround - they
> > specified
> > >> an invalid value and have been minorly inconvenienced because the
> error
> > >> they got about it was not very readable, so fixing their value took
> > them a
> > >> little longer than it could have, but they fixed it and their work is
> > not
> > >> blocked. I think Robert's arguments about the cost of trivial tests
> > apply.
> > >>
> > >> I agree that the author should be at liberty to choose which
> validation
> > to
> > >> unit-test and which to skip 

Re: [DISCUSSION] using NexMark for Beam

2017-03-21 Thread Dan Halperin
Not a deep response, but this is awesome! We'd really like to have some
good benchmarks, and I'm excited you're updating Nexmark. This will be
great!

On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot 
wrote:

> Hi all,
>
> Ismael and I are working on upgrading the Nexmark implementation for Beam.
> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> for the original PR.
>
> The PR contains queries that have a wide coverage of the Beam model and
> that represent a realistic end user use case (some come from client
> experience on Google Cloud Dataflow).
>
> So far, we have upgraded the implementation to the latest Beam snapshot.
> And we are able to execute a good subset of the queries in the different
> runners. We upgraded the nexmark drivers to do so: direct driver (upgraded
> from inProcessDriver) and flink driver and we added a new one for spark.
>
> There is still a good amount of work to do and we would like to know if
> you think that this contribution can have its place into Beam eventually.
>
> The interests of having Nexmark on Beam that we have seen so far are:
>
> - Rich batch/streaming test
>
> - A-B testing of runners or runtimes (non-regression, performance
> comparison between versions ...)
>
> - Integration testing (sdk/runners, runner/runtime, ...)
>
> - Validate beam capability matrix
>
> - It can be used as part of the ongoing PerfKit work (if there is any
> interest).
>
> As a final note, we are tracking the issues in the same repo. If someone
> is interested in contributing, or have more ideas, you are welcome :)
>
> Etienne
>
>


Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Dan Halperin
Note that even "unbounded pipeline in a streaming runner".waitUntilFinish()
can return, e.g., if you cancel it or terminate it. It's totally reasonable
for users to want to understand and handle these cases.

+1

Dan

On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Good idea !!
>
> Regards
> JB
>
>
> On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
>
>> Raising this onto the mailing list from
>> https://issues.apache.org/jira/browse/BEAM-849
>>
>> The issue came up: what does it mean for a pipeline to finish, in the Beam
>> model?
>>
>> Note that I am deliberately not talking about "batch" and "streaming"
>> pipelines, because this distinction does not exist in the model. Several
>> runners have batch/streaming *modes*, which implement the same semantics
>> (potentially different subsets: in batch mode typically a runner will
>> reject pipelines that have at least one unbounded PCollection) but in an
>> operationally different way. However we should define pipeline termination
>> at the level of the unified model, and then make sure that all runners in
>> all modes implement that properly.
>>
>> One natural way is to say "a pipeline terminates when the output
>> watermarks
>> of all of its PCollection's progress to +infinity". (Note: this can be
>> generalized, I guess, to having partial executions of a pipeline: if
>> you're
>> interested in the full contents of only some collections, then you wait
>> until only the watermarks of those collections progress to infinity)
>>
>> A typical "batch" runner mode does not implement watermarks - we can think
>> of it as assigning watermark -infinity to an output of a transform that
>> hasn't started executing yet, and +infinity to output of a transform that
>> has finished executing. This is consistent with how such runners implement
>> termination in practice.
>>
>> Dataflow streaming runner additionally implements such termination for
>> pipeline drain operation: it has 2 parts: 1) stop consuming input from the
>> sources, and 2) wait until all watermarks progress to infinity.
>>
>> Let us fill the gap by making this part of the Beam model and declaring
>> that all runners should implement this behavior. This will give nice
>> properties, e.g.:
>> - A pipeline that has only bounded collections can be run by any runner in
>> any mode, with the same results and termination behavior (this is actually
>> my motivating example for raising this issue is: I was running Splittable
>> DoFn tests
>> > src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
>> with the streaming Dataflow runner - these tests produce only bounded
>> collections - and noticed that they wouldn't terminate even though all
>> data
>> was processed)
>> - It will be possible to implement pipelines that stream data for a while
>> and then eventually successfully terminate based on some condition. E.g. a
>> pipeline that watches a continuously growing file until it is marked
>> read-only, or a pipeline that reads a Kafka topic partition until it
>> receives a "poison pill" message. This seems handy.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: First stable release: version designation?

2017-03-01 Thread Dan Halperin
A large set of Beam users will be coming from the pre-Apache technologies
(aka Google Cloud Dataflow, Scio). Because Dataflow was 1.0 before Beam
started, there is a lot of pre-existing documentation, Stack Overflow, etc.
that refers to version 1.0 to mean what is now a year-and-a-half old
release.

I think starting Beam from "2.0.0" will be best for that set of users and
frankly also new ones -- this will make it unambiguous whether referring to
pre-Beam or Beam releases.

I understand the 1.0 motivation -- it's cleaner in isolation -- but I think
it would lead to long-term confusion in the user community.

On Wed, Mar 1, 2017 at 1:11 PM, Ted Yu  wrote:

> +1 to what Jesse and Amit said.
>
> On Wed, Mar 1, 2017 at 12:32 PM, Amit Sela  wrote:
>
> > I think 1.0.0 for a couple of reasons:
> >
> > * It makes sense coming after 0.X (+1 Jesse).
> > * It is the FIRST stable release as a project, regardless of its roots.
> > * while the SDK is definitely a 2.0.0, Beam is not made only of the SDK,
> > and I hope we'll have more milage with users running all sorts of runners
> > in production before our 2.0.0 release.
> >
> > Amit.
> >
> > On Wed, Mar 1, 2017 at 10:25 PM Jesse Anderson 
> > wrote:
> >
> > I think 1.0 makes the most sense.
> >
> > On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:
> >
> > > The first stable release is our next major project-wide goal; see
> > > discussion in [1]. I've been referring to it as "the first stable
> > release"
> > > for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> > > make sure we have an unbiased discussion and a consensus-based decision
> > on
> > > this matter.
> > >
> > > I think that now is the time to consider the appropriate designation
> for
> > > our first stable release, and formally make a decision on it. A
> > reasonable
> > > choices could be "1.0.0" or "2.0.0", perhaps there are others.
> > >
> > > 1.0.0:
> > > * It logically comes after the current series, 0.x.y.
> > > * Most people would expect it, I suppose.
> > > * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> > > same number.
> > >
> > > 2.0.0:
> > > * Follows the pattern some other projects have taken -- continuing
> their
> > > version numbering scheme from their previous origin.
> > > * Better communicates project's roots, and degree of maturity.
> > > * May be unexpected to some users.
> > >
> > > I'd invite everyone to share their thoughts and preferences -- names
> are
> > > important and well correlated with success. Thanks!
> > >
> > > Davor
> > >
> > > [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae
> 973c629
> > > 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
> > >
> >
>


Re: Beam join with double stream join key

2017-02-28 Thread Dan Halperin
Hi,

It looks like you may have tried to attach an image or something, but it
did not come through the mailing list. Can you please try again?

This is what we see:
https://lists.apache.org/thread.html/f4a1ce5291428a70ecd54d3eefff56daf2f32b7a558f575eddc3729e@%3Cdev.beam.apache.org%3E

Dan

On Tue, Feb 28, 2017 at 2:40 AM, 钱爽(子颢) 
wrote:

> hello, in my case, I wan't to join  double stream, while my join key is a
> customer-defined Class(DataWithMeta), but exected failed! Below is the
> result! I don't know why
>
>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-16 Thread Dan Halperin
Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote:

> I agree with Dan on everything regarding HdfsFileSystem - it's super
> convenient for users to use TextIO with HdfsFileSystem rather then
> replacing the IO and also specifying the InputFormat type.
>
> I disagree on "HadoopIO" - I think that people who work with Hadoop would
> find this name intuitive, and that's whats important.
> Even more, and joining Raghu's comment, it is also recognized as
> "compatible with Hadoop", so for example someone running a Beam pipeline
> using the Spark runner on Amazon's S3 and wants to read/write Hadoop
> sequence files would simply use HadoopIO and provide the appropriate
> runtime dependencies (actually true for GS as well).
>
> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi <rang...@google.com.invalid>
> wrote:
>
> > FileInputFormat is extremely widely used, pretty much all the file based
> > input formats extend it. All of them call into to list the input files,
> > split (with some tweaks on top of that). The special API (
> > *FileInputFormat.setMinInputSplitSize(job,
> > desiredBundleSizeBytes)* ) is how the split size is normally
> communicated.
> > New IO can use the api directly.
> >
> > HdfsIO as implemented in Beam is not HDFS specific at all. There are no
> > hdfs imports and HDFS name does not appear anywhere other than in
> HdfsIO's
> > own class and method names. AvroHdfsFileSource etc would work just as
> well
> > with new IO.
> >
> > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> <dhalp...@google.com.invalid
> > >
> > wrote:
> >
> > > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is
> > the
> > > key component of the name -- it reads things that implement the
> > InputFormat
> > > interface. "Hadoop" means a lot more than that.)
> > >
> >
> > Often 'IO' in Beam implies both sources and sinks. It might not be long
> > before we might be supporting Hadoop OutputFormat as well. In addition
> > HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of
> > things depending on the context. In 'IO' context it might not be too
> broad.
> > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> >
> > Either way, I am quite confident once HadoopInputFormatIO is written, it
> > can easily replace HdfsIO. That decision could be made later.
> >
> > Raghu.
> >
>


Re: Should you always have a separate PTransform class for a new transform?

2017-02-07 Thread Dan Halperin
A little bit more inline:

On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hello,
>
> I was auditing Beam for violations of PTransform style guide
> https://beam.apache.org/contribute/ptransform-style-guide/ and came across
> another style point that deserves discussion.
>
> Look at Count transform:
>
>   public static  Combine.Globally globally() {
> return Combine.globally(new CountFn());
>   }
>
>   public static  Combine.PerKey perKey() {
> return Combine.perKey(new CountFn());
>   }
>
>   public static  PerElement perElement() {
> return new PerElement<>();
>   }
>
> I asked myself: should globally() and perKey() also define wrapper classes
> - e.g. should it be "public static  Globally globally()" where
> "Globally" is a new inner class of Count?
>
> I argue that the answer is yes, but it's not clear-cut.
> Cons:
> - If we return a Combine.Globally, the user can use the features provided
> by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> .asSingletonView().
>

+1 to this point, which was a conscious decision in the pre-Beam days
(which of course means it IS worth revisiting ;).
Trying to replay the reasoning:

* If wrapping, the author of a new Count.Globally can now only make the
extra functionality in Combine available by similarly exposing all such
functions.

* Conversely, the current implementation makes new functionality in Combine
available "for free" to users of Count.globally(). Whereas new
functionality on Combine would otherwise mandate that *all wrappers* change
to actually be exposed.

* There's almost no data here, but: we have added new functionality to
Combine (withSideInputs) and have not added new functionality to Count.


> Pros:
> - Style consistency with other transforms. Almost all transforms have their
> own class, and their factory functions return that class.
>

IMO this should only happen if the user needs that class. For all examples
I'm aware of,
the returned class has stuff you need to do, like configuring coders or
side inputs or other parameters.
IMO if the user need not configure, return the least constraining thing you
can.


> - Implementation can evolve. However, in case of Count, that is unlikely.
>

+1 to "unlikely"


> - ...Or is it? If the transform has a concrete class, then the runner can
> intercept that class and e.g. provide an (even) more efficient
> implementation of Count.Globally. This gets much more awkward if the runner
> has to intercept every Combine and check whether it's combining using a
> CountFn.
>

Post-runner API, this argument goes away. Instead, the runner will need to
inspect generically the attributes of the transform to do the replacement.

Summarizing: I currently am a -0.8 on this proposal.


>
> So, I propose to add this as a style guide rule, and solve the problem in
> "Cons" by saying "Yeah, if you want the extra features of the transform
> you're expanding into, you have to propagate them through the API of your
> transform and delegate to the underlying transform manually".
>
> Thoughts?
>


Re: Should you always have a separate PTransform class for a new transform?

2017-02-07 Thread Dan Halperin
I'll agree with the "Cons" by referencing back to this thread:

https://lists.apache.org/thread.html/caa8k_flvcmx+tyksxdmcxxe9y_zyohe4ovht9f2jb1wckob...@mail.gmail.com

On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hello,
>
> I was auditing Beam for violations of PTransform style guide
> https://beam.apache.org/contribute/ptransform-style-guide/ and came across
> another style point that deserves discussion.
>
> Look at Count transform:
>
>   public static  Combine.Globally globally() {
> return Combine.globally(new CountFn());
>   }
>
>   public static  Combine.PerKey perKey() {
> return Combine.perKey(new CountFn());
>   }
>
>   public static  PerElement perElement() {
> return new PerElement<>();
>   }
>
> I asked myself: should globally() and perKey() also define wrapper classes
> - e.g. should it be "public static  Globally globally()" where
> "Globally" is a new inner class of Count?
>
> I argue that the answer is yes, but it's not clear-cut.
> Cons:
> - If we return a Combine.Globally, the user can use the features provided
> by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> .asSingletonView().
> Pros:
> - Style consistency with other transforms. Almost all transforms have their
> own class, and their factory functions return that class.
> - Implementation can evolve. However, in case of Count, that is unlikely.
> - ...Or is it? If the transform has a concrete class, then the runner can
> intercept that class and e.g. provide an (even) more efficient
> implementation of Count.Globally. This gets much more awkward if the runner
> has to intercept every Combine and check whether it's combining using a
> CountFn.
>
> So, I propose to add this as a style guide rule, and solve the problem in
> "Cons" by saying "Yeah, if you want the extra features of the transform
> you're expanding into, you have to propagate them through the API of your
> transform and delegate to the underlying transform manually".
>
> Thoughts?
>


Re: Doesn't PAssertTest.runExpectingAssertionFailure need to call waitUntilFinish?

2017-01-31 Thread Dan Halperin
Hi Shen,

Great question. The trick is that the `pipeline` object is an instance of
TestPipeline [0], for which p.run() is the same as
p.run().waitUntilFinish().

It might be documentationally better to use p.run().waitUntilFinish() to be
consistent with real runners, or add a method to TestPipeline
p.runTestPipeline() to signal that this works only in tests. At the same
time, that would complicate writing tests, which we don't really want to
do... so it's a tradeoff that may be okay as-is.

Dan

[0]
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L64



On Tue, Jan 31, 2017 at 1:07 PM, Shen Li  wrote:

> Hi,
>
> In the PAssertTest, doesn't it need to append a "waitUntilFinish()" to the
> "pipeline.run()" (please see the link below)? Otherwise, the runner may
> return the PipelineResult immediately without actually kicking off the
> execution, and therefore the AssertionError won't be thrown. Or did I miss
> anything?
>
> https://github.com/apache/beam/blob/master/sdks/java/
> core/src/test/java/org/apache/beam/sdk/testing/PAssertTest.java#L399
>
> Thanks,
>
> Shen
>


Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-31 Thread Dan Halperin
Should we revert the CLs that lost the functionality? I'd really not like
to ship a release with such a functional regression

On Tue, Jan 31, 2017 at 10:07 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Fair enough. Let's do that.
>
> Thanks !
>
> Regards
> JB
>
>
> On 01/31/2017 06:58 PM, Aljoscha Krettek wrote:
>
>> I'm not sure. Poperly fixing this will take some time, especially since we
>> have to add tests to prevent breakage from happening in the future. Plus,
>> if my analysis is correct other runners might also not have proper late
>> data dropping and it's fine to have a release with some missing features.
>> (There's more besides dropping.)
>>
>> I think we should go ahead and fix for 0.6.
>>
>> On Tue, Jan 31, 2017, 18:23 Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
>>
>> Hi Aljoscha,
>>>
>>> so you propose to cancel this vote to prepare a RC2 ?
>>>
>>> Regards
>>> JB
>>>
>>> On 01/31/2017 05:06 PM, Aljoscha Krettek wrote:
>>>
>>>> It's not just an issue with the Flink Runner, if I'm not mistaken.
>>>>
>>>> Flink had late-data dropping via the LateDataDroppingDoFnRunner (which
>>>>
>>> got
>>>
>>>> "disabled" by the two commits I mention in the issue) while I think that
>>>> the Apex and Spark Runners might not have had dropping in the first
>>>>
>>> place.
>>>
>>>> (Not sure about this last part.)
>>>>
>>>> As I now wrote to the issue I think this could be a blocker because we
>>>> don't have the correct output in some cases.
>>>>
>>>> On Tue, 31 Jan 2017 at 02:16 Davor Bonaci <da...@apache.org> wrote:
>>>>
>>>> It looks good to me, but let's hear Aljoscha's opinion on BEAM-1346.
>>>>>
>>>>> A passing suite of Jenkins jobs:
>>>>> * https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/6870/
>>>>> * https://builds.apache.org/job/beam_PostCommit_Java_MavenInst
>>>>> all/2474/
>>>>> *
>>>>>
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Apex/336/
>>>
>>>> *
>>>>>
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Flink/1470/
>>>
>>>> *
>>>>>
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Spark/786/
>>>
>>>> *
>>>>>
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
>>> nService_Dataflow/2130/
>>>
>>>>
>>>>> On Mon, Jan 30, 2017 at 4:40 PM, Dan Halperin <dhalp...@apache.org>
>>>>>
>>>> wrote:
>>>
>>>>
>>>>> I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for
>>>>>>
>>>>> RC1
>>>>>
>>>>>> and would at least wait for resolution there before proceeding.
>>>>>>
>>>>>> On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net
>>>>>>
>>>>>
>>>> wrote:
>>>>>>
>>>>>> Good catch for the PPMC, I'm upgrading the email template in the
>>>>>>>
>>>>>> release
>>>>>
>>>>>> guide (it was a copy/paste).
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 01/30/2017 11:50 AM, Sergio Fernández wrote:
>>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> So far I've successfully checked:
>>>>>>>> * signatures and digests
>>>>>>>> * source releases file layouts
>>>>>>>> * matched git tags and commit ids
>>>>>>>> * incubator suffix and disclaimer
>>>>>>>> * NOTICE and LICENSE files
>>>>>>>> * license headers
>>>>>>>> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
>>>>>>>>
>>>>>>>> Two minor comments that do not block the release:
&g

Re: [VOTE] Apache Beam, version 0.5.0, release candidate #1

2017-01-30 Thread Dan Halperin
I am worried about https://issues.apache.org/jira/browse/BEAM-1346 for RC1
and would at least wait for resolution there before proceeding.

On Mon, Jan 30, 2017 at 3:48 AM, Jean-Baptiste Onofré 
wrote:

> Good catch for the PPMC, I'm upgrading the email template in the release
> guide (it was a copy/paste).
>
> Regards
> JB
>
>
> On 01/30/2017 11:50 AM, Sergio Fernández wrote:
>
>> +1 (non-binding)
>>
>> So far I've successfully checked:
>> * signatures and digests
>> * source releases file layouts
>> * matched git tags and commit ids
>> * incubator suffix and disclaimer
>> * NOTICE and LICENSE files
>> * license headers
>> * clean build (Java 1.8.0_91, Maven 3.3.9, Debian amd64)
>>
>> Two minor comments that do not block the release:
>> * Usually I like to see the commit id referencing the rc, since git tags
>> can be changed.
>> * Just a formality, "PPMC" is not committee that plays a role anymore,
>> you're a PMC now ;-)
>>
>>
>>
>> On Fri, Jan 27, 2017 at 9:55 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version 0.5.0
>>> as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which includes:
>>>
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v0.5.0-RC1" [5],
>>> * website pull request listing the release and publishing the API
>>> reference manual [6].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PPMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>>> ctId=12319527=12338859
>>> [2] https://dist.apache.org/repos/dist/dev/beam/0.5.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapache
>>> beam-1010/
>>> [5] https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
>>> efs/tags/v0.5.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/132
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Build failed in Jenkins: beam_PostCommit_Java_MavenInstall #2473

2017-01-30 Thread Dan Halperin
Hey folks,

It looks like the python-sdk -> master merge went bad and, unfortunately,
we have it configured to email anyone who ever contributed a commit to the
merge, which I think devolves to "anyone who ever committed to that
branch". I've disabled further emails in this job's configuration for the
rest of the day, by which time the build will hopefully be green again.

On Mon, Jan 30, 2017 at 4:24 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  >
>
> --
> [...truncated 12560 lines...]
> hard linking apache_beam/transforms/util.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/transforms
> hard linking apache_beam/transforms/window.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/transforms/window_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/transforms/write_ptransform_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/transforms
> hard linking apache_beam/typehints/__init__.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/decorators.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/opcodes.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/trivial_inference.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/trivial_inference_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typecheck.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typed_pipeline_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typehints.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/typehints/typehints_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/typehints
> hard linking apache_beam/utils/__init__.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/annotations.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/annotations_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/counters.pxd -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/counters.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/dependency.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/dependency_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/names.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/path.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/path_test.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/pipeline_options.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_validator.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/pipeline_options_validator_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/processes.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/processes_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/profiler.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/retry.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/retry_test.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/timestamp.py -> apache-beam-sdk-0.6.0.dev/
> apache_beam/utils
> hard linking apache_beam/utils/timestamp_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value.pxd ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam/utils/windowed_value_test.py ->
> apache-beam-sdk-0.6.0.dev/apache_beam/utils
> hard linking apache_beam_sdk.egg-info/PKG-INFO ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/SOURCES.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/dependency_links.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/entry_points.txt ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking apache_beam_sdk.egg-info/not-zip-safe ->
> apache-beam-sdk-0.6.0.dev/apache_beam_sdk.egg-info
> hard linking 

Re: TextIO binary file

2017-01-30 Thread Dan Halperin
Stas' comment is the right one. The "canonical" use of TextIO is using
something like a TextualIntegerCoder
,
but that should almost certainly be replaced with TextIO.Read |
ParDo.of(Parse integer). The `withCoder` functions need to get removed or
replaced.

For "holding a file of arbitrary records" -- simply producing a
delimiter-separated TextIO is probably not a good choice. Specifically,
splitting is broken when the delimiter might appear in the output (e.g.,
when using almost any coder). A better option is to design a file format to
hold arbitrary records. E.g., an Avro file where each record is just a
byte[].

Dan

On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur  wrote:

> The Javadoc of TextIO states:
>
> * By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: Consistent Placement

2017-01-27 Thread Dan Halperin
On Fri, Jan 27, 2017 at 8:41 AM, Jesse Anderson <je...@smokinghand.com>
wrote:

> @dan I thought you were talking about the transform class definition:
>   public static class GroupedValues<K, InputT, OutputT>
>   extends PTransform
> <PCollection Iterable>>,
>  PCollection<KV<K, OutputT>>> {
>

If the user needs to call functions on the returned type (in this case,
Combine.groupedValues() returns a GroupedValues, which allows the user to
configure side inputs using GroupedValues#withSideInputs), then both:

* the function groupedValues() needs to return a GroupedValues, so that the
calling code can access methods like GroupedValues#withSideInputs.
* the class GroupedValues needs to be public, so that the above works.

and, also, as a matter of practice,

* Comprehensive Javadoc should be class-level on public transforms,
especially when there's many factory methods for these transforms.

Dan


>
>
> On Fri, Jan 27, 2017 at 11:30 AM Dan Halperin <dhalp...@google.com.invalid
> >
> wrote:
>
> > Hi Jesse, can you specifically say which functions on Combine and Count
> > you're thinking of? I believe these transforms are consistent with the
> > "principle of least visibility" -- make nothing more public than it needs
> > to be.
> >
> > Look at Combine.globally
> > <
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L124
> > >.
> > It returns a Globally, but that is because Globally has a useful public
> API
> > surface, adding functions like asSingletonView. I believe similar
> reasoning
> > applies to Count.
> >
> > However, for cases where the user will not further configure the return
> > value, it makes sense to return the most public type we can.
> >
> > On Fri, Jan 27, 2017 at 6:39 AM, Jesse Anderson <je...@smokinghand.com>
> > wrote:
> >
> > > One con to making transform classes be private would be that it is a
> > > breaking change. If anyone uses that class directly or extends that
> > class,
> > > we'd be breaking that.
> > >
> > > On Fri, Jan 27, 2017 at 9:37 AM Jesse Anderson <je...@smokinghand.com>
> > > wrote:
> > >
> > > > Continuing a discussion <https://github.com/apache/beam/pull/1830>
> > Dan,
> > > > Kenn, and I were having here since the bug is closed. They pointed
> out
> > > > three things:
> > > >
> > > >- Where the private constructor gets placed in the class
> > > >- Where the code samples of how to use the class get placed (in
> the
> > > >Transform versus in the static method)
> > > >- Are transform classes public or private
> > > >
> > > > I noted that those were inconsistent in the code. When I write a new
> > > > transform, I use one of the already written transforms as the basis.
> > > >
> > > > Looking at Combine and Count:
> > > >
> > > >- The private constructor is at the top of the class
> > > >- The code sample is in the Transform class
> > > >- The transform class is marked as public
> > > >
> > > > I don't have a strong opinion on private constructor and transform
> > being
> > > > marked as private/public. I think we should standardize on placing
> code
> > > > samples in the static helper methods. That's where people are looking
> > > when
> > > > using these methods.
> > > >
> > > > I think we need to do a general pass to make these consistent after
> we
> > > > decide on how they should be done.
> > > >
> > > > Thanks,
> > > >
> > > > Jesse
> > > >
> > >
> >
>


Re: Consistent Placement

2017-01-27 Thread Dan Halperin
Hi Jesse, can you specifically say which functions on Combine and Count
you're thinking of? I believe these transforms are consistent with the
"principle of least visibility" -- make nothing more public than it needs
to be.

Look at Combine.globally
.
It returns a Globally, but that is because Globally has a useful public API
surface, adding functions like asSingletonView. I believe similar reasoning
applies to Count.

However, for cases where the user will not further configure the return
value, it makes sense to return the most public type we can.

On Fri, Jan 27, 2017 at 6:39 AM, Jesse Anderson 
wrote:

> One con to making transform classes be private would be that it is a
> breaking change. If anyone uses that class directly or extends that class,
> we'd be breaking that.
>
> On Fri, Jan 27, 2017 at 9:37 AM Jesse Anderson 
> wrote:
>
> > Continuing a discussion  Dan,
> > Kenn, and I were having here since the bug is closed. They pointed out
> > three things:
> >
> >- Where the private constructor gets placed in the class
> >- Where the code samples of how to use the class get placed (in the
> >Transform versus in the static method)
> >- Are transform classes public or private
> >
> > I noted that those were inconsistent in the code. When I write a new
> > transform, I use one of the already written transforms as the basis.
> >
> > Looking at Combine and Count:
> >
> >- The private constructor is at the top of the class
> >- The code sample is in the Transform class
> >- The transform class is marked as public
> >
> > I don't have a strong opinion on private constructor and transform being
> > marked as private/public. I think we should standardize on placing code
> > samples in the static helper methods. That's where people are looking
> when
> > using these methods.
> >
> > I think we need to do a general pass to make these consistent after we
> > decide on how they should be done.
> >
> > Thanks,
> >
> > Jesse
> >
>


Re: Better developer instructions for using Maven?

2017-01-25 Thread Dan Halperin
Here is my summary of the threads:

Overwhelming agreement:

- rename `release` to something more appropriate.
- add `checkstyle` to the default build (it's basically a compile error)
- add more information to contributor guide

Reasonable agreement

- don't update the github instructions to make passing `mvn verify -P` mandatory. Maybe add a hint that this is a good proxy for what
Jenkins will run.

Unresolved:

- whether all checks should be in `mvn verify`
- whether `mvn test` is useful for most workflows

I'll propose to proceed with the overwhelmingly agreed-upon changes, and as
we see increasingly many new contributors re-evaluate the remaining issues.

Thanks,
Dan

On Tue, Jan 24, 2017 at 12:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> +1 to at least update the contribution guide and improve the profile name.
>
> Regards
> JB
>
>
> On 01/24/2017 09:49 PM, Kenneth Knowles wrote:
>
>> My impression is that we don't have consensus on whether all checks or
>> minimal checks should be the default, or whether we can have both via `mvn
>> test` and `mvn verify`.
>>
>> But that doesn't prevent us from giving -P release a better name and
>> mentioning it in the dev guide and in some manner in our PR template.
>>
>> Right now we are living with the combination of the bad aspects - default
>> is not thorough but not actually fast and a thorough check is
>> undocumented.
>>
>> On Tue, Jan 24, 2017 at 2:22 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>> I just wanted to know if we have achieved some consensus about this, I
>>> just
>>> saw this PR that reminded me about this discussion.
>>>
>>> ​https://github.com/apache/beam/pull/1829​
>>>
>>> It is important that we mention the existing profiles (and the intended
>>> checks) in the contribution guide (e.g. -Prelease (or -Pall-checks
>>> triggers
>>> these validations).
>>>
>>> I can add this to the guide if you like once we define the checks per
>>> stage/profile.
>>>
>>> Ismaël
>>>
>>>
>>> On Wed, Jan 11, 2017 at 8:12 AM, Aviem Zur <aviem...@gmail.com> wrote:
>>>
>>> I agree with Dan and Lukasz.
>>>> Developers should not be expected to know beforehand which specific
>>>> profiles to run.
>>>> The phase specified in the PR instructions (`verify`) should run all the
>>>> relevant verifications and be the "slower" build, while a preceding
>>>> lifecycle, such as `test`, should run the "faster" verifications.
>>>>
>>>> Aviem.
>>>>
>>>> On Mon, Jan 9, 2017 at 7:57 PM Robert Bradshaw
>>>>
>>> <rober...@google.com.invalid
>>>
>>>>
>>>>> wrote:
>>>>
>>>> On Mon, Jan 9, 2017 at 3:49 AM, Aljoscha Krettek <aljos...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I also usually prefer "mvn verify" to to the expected thing but I see
>>>>>>
>>>>> that
>>>>>
>>>>>> quick iteration times are key.
>>>>>>
>>>>>
>>>>> I see
>>>>> https://maven.apache.org/guides/introduction/
>>>>>
>>>> introduction-to-the-lifecycle.html
>>>>
>>>>>
>>>>> verify - run any checks on results of integration tests to ensure
>>>>> quality criteria are met
>>>>>
>>>>> Of course our integration tests are long enough that we shouldn't be
>>>>> putting all of them here, but I too would expect checkstyle.
>>>>>
>>>>> Perhaps we could introduce a verify-fast or somesuch for fast (but
>>>>> lower coverage) turnaround time. I would expect "mvn verify test" to
>>>>> pass before submitting a PR, and would want to run that before asking
>>>>> others to look at it. I think this should be our criteria (i.e. what
>>>>> will a new but maven-savvy user run before pushing their code).
>>>>>
>>>>> As long as the pre-commit hooks still check everything I'm ok with
>>>>>>
>>>>> making
>>>>
>>>>> the default a little more lightweight.
>>>>>>
>>>>>
>>>>> The fact that our pre-commit hooks take a long time to run does change
>>>>> things. Nothing more annoying than seeing that your PR failed 3 hours
>>>>> later because you had 

Re: Subscription to to beam project

2017-01-23 Thread Dan Halperin
+original mailer, assuming he is not on dev@...

On Sun, Jan 22, 2017 at 7:31 PM, Davor Bonaci  wrote:

> Welcome! Please check out the support page [1] with all mailing lists and
> subscribe links.
>
> [1] https://beam.apache.org/get-started/support/
>
> On Sat, Jan 21, 2017 at 11:59 PM, Ritesh Kasat 
> wrote:
>
> > Hello,
> > Please add me to the beam mailing list.
> > Thanks
> > Ritesh
> >
>


Re: [VOTE] Merge Python SDK to the master branch

2017-01-20 Thread Dan Halperin
[X] +1, Merge python-sdk branch to master after the 0.5.0 release, and release
it in the subsequent minor release.

Thanks and woo!

On Fri, Jan 20, 2017 at 12:00 PM, Jean-Baptiste Onofré 
wrote:

> +1 to merge Python SDK after 0.5.0 release.
>
> Regards
> JB
>
>
> On 01/20/2017 06:03 PM, Ahmet Altay wrote:
>
>> Hi all,
>>
>>
>> Please review the earlier discussion on the status of the Python SDK [1]
>> and vote on merging the python-sdk branch to the master branch, as
>> follows:
>>
>> [ ] +1, Merge python-sdk branch to master after the 0.5.0 release, and
>> release it in the subsequent minor release.
>>
>> [ ] -1, Continue development in python-sdk branch (please provide specific
>> comments).
>>
>> The vote will be open for at least 72 hours. This is a procedural vote; it
>> is adopted by majority approval of qualified votes with no minimums [2].
>>
>> Thank you,
>>
>> Ahmet
>>
>> [1]
>> https://lists.apache.org/thread.html/84a36cf0ad95a76e6bc4446
>> 03ae87e7312023bc167a6ff3c57a956f1@%3Cdev.beam.apache.org%3E
>> [2] http://apache.org/foundation/voting.html
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Runner-provided ValueProviders

2017-01-20 Thread Dan Halperin
I think this was initially motivated by BEAM-758
. Copying from that issue:

In the forthcoming runner API, a user will be able to save a pipeline
to JSON and then run it repeatedly.

Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single
random value (nonce). These values are typically generated at pipeline
construction time (in PTransform#expand), so that they are deterministic
(don't change across retries of DoFns) and global (are the same across all
workers).

However, once the runner API lands the existing code would result in
the same nonce being reused across jobs, which breaks BigQueryIO. Other
possible solutions:
   * Generate nonce in Create(1) | ParDo then use this as a side input.
Should work, as along as side inputs are actually checkpointed. But does
not work for BoundedSource, which cannot accept side inputs.
   * If a nonce is only needed for the lifetime of one bundle, can be
generated in startBundle and used in processElement/finishBundle/tearDown.
   * Add some context somewhere that lets user code access unique step
name, and somehow generate a nonce consistently e.g. by hashing. Will
usually work, but this is similarly not available to sources.

I believe your proposal is to add such a nonce to the root PipelineOptions
object -- perhaps, `String getRunNonce()` or something like that. This
would let us have a different nonce for every Pipeline.run() call, but it
would add the requirement to runners that they must populate it.

My 2c: This would be an easy change for runners and unblocks the issue, but
it complicates the demand on runner authors. Longer-term, plumbing a
context into places like BoundedSource and providing the value there is a
better idea.

Dan

On Fri, Jan 20, 2017 at 11:30 AM, Davor Bonaci  wrote:

> Expecting runners to populate, or override, SDK-level pipeline options
> isn't a great thing, particularly in a scenario that would affect
> correctness.
>
> The main thing is discoverability of a subtle API like this -- there's
> little chance somebody writing a new runner would stumble across this and
> do the right thing. It would be much better to make expectations from a
> runner clear, say, via a runner-provided "context" API. I'd stay away from
> a pipeline option with a default value.
>
> The other contentions topic here is the usage of a job-level or
> execution-level identifier. This easily becomes ambiguous in the presence
> of Flink's savepoints, Dataflow's update, fast re-execution, canary vs.
> production pipeline, cross-job optimizations, etc. I think we'd be better
> off with a transform-level nonce than a job-level one.
>
> Finally, the real solution is to enhance the model and make such a
> functionality available to everyone, e.g., roughly "init" + "checkpoint" +
> "side-input to source / splittabledofn / composable io".
>
> --
>
> Practically, to solve the problem at hand quickly, I'd be in favor of a
> context-based approach.
>
> On Thu, Jan 19, 2017 at 10:22 AM, Sam McVeety 
> wrote:
>
> > Hi folks, I'm looking for feedback on whether the following is a
> reasonable
> > approach to handling ValueProviders that are intended to be populated at
> > runtime by a given Runner (e.g. a Dataflow job ID, which is a GUID from
> the
> > service).  Two potential pieces of a solution:
> >
> > 1. Annotate such parameters with @RunnerProvided, which results in an
> > Exception if the user manually tries to set the parameter.
> >
> > 2. Allow for a DefaultValueFactory to be present for the set of Runners
> > that do not override the parameter.
> >
> > Best,
> > Sam
> >
>


Re: Beam Fn API

2017-01-19 Thread Dan Halperin
"relatively little extra work" once the base APIs are implemented.

On Thu, Jan 19, 2017 at 11:26 PM, Dan Halperin <dhalp...@google.com> wrote:

> This is an extremely ambitious part of the technical vision. I think it's
> a lot of work, but well worth it -- Python-SDK-on-Java-runner with
> relatively extra work? I don't care what the overhead is, this is making
> the impossible possible.
>
> On Thu, Jan 19, 2017 at 3:56 PM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
>> I have been prototyping several components towards the Beam technical
>> vision of being able to execute an arbitrary language using an arbitrary
>> runner.
>>
>> I would like to share this overview [1] of what I have been working
>> towards. I also share this PR [2] with a proposed API, service definitions
>> and partial implementation.
>>
>> 1: https://s.apache.org/beam-fn-api
>> 2: https://github.com/apache/beam/pull/1801
>>
>> Please comment on the overview within this thread, and any specific code
>> comments on the PR directly.
>>
>> Luke
>>
>
>


Re: Beam Fn API

2017-01-19 Thread Dan Halperin
This is an extremely ambitious part of the technical vision. I think it's a
lot of work, but well worth it -- Python-SDK-on-Java-runner with relatively
extra work? I don't care what the overhead is, this is making the
impossible possible.

On Thu, Jan 19, 2017 at 3:56 PM, Lukasz Cwik 
wrote:

> I have been prototyping several components towards the Beam technical
> vision of being able to execute an arbitrary language using an arbitrary
> runner.
>
> I would like to share this overview [1] of what I have been working
> towards. I also share this PR [2] with a proposed API, service definitions
> and partial implementation.
>
> 1: https://s.apache.org/beam-fn-api
> 2: https://github.com/apache/beam/pull/1801
>
> Please comment on the overview within this thread, and any specific code
> comments on the PR directly.
>
> Luke
>


Re: Composite Types and the Runner API

2017-01-19 Thread Dan Halperin
skimmed doc and PR, +1.

On Tue, Jan 17, 2017 at 4:26 PM, Lukasz Cwik 
wrote:

> +1 since this brings us closer to a portability story.
>
> On Tue, Jan 17, 2017 at 3:10 PM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > It makes sense.
> >
> > Thanks !
> > Regards
> > JB
> >
> >
> > On 01/17/2017 10:46 AM, Thomas Groh wrote:
> >
> >> Hey everyone;
> >>
> >> I've been working on parts of the runner API recently, and part of that
> >> has
> >> included a shift of how composite inputs and outputs must be represented
> >> by
> >> the time a PipelineRunner begins to access them. I have a PR that
> >> completes
> >> this work within the Java SDK, but wanted to ensure that everyone agrees
> >> on
> >> the change and anything required on their end before I start fiddling
> with
> >> all of the runner internals. For anyone except current runner authors,
> >> this
> >> should be completely transparent; for current runner authors, I need a
> >> short code review but nothing else.
> >>
> >> I've written a one-pager about what's changing; the link is at
> >> https://s.apache.org/beam-runner-composites
> >>
> >> or directly at
> >> https://docs.google.com/document/d/1_CHLnj1RFAGKy_MfR54Xmixa
> >> kYNmCnhGZLWmuDSMJ10/edit#heading=h.qlkikisrzqqf
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [VOTE] Release 0.4.0, release candidate #1

2016-12-29 Thread Dan Halperin
* mvn verify passes with and without network enabled
* mvn apache-ret:check passes
* mvn verify passes with -Prelease
* release signature properly signed by JB (using the KEYS file as the
keyring)
* No binary files [one false positive empty file
in ./runners/core-java/src/test/java/.placeholder we should plausibly
delete in future]
(osx: find . -type f -exec file -I '{}' \; | grep 'charset=binary')

* Module changes are as expected (microbenchmarks had a licensing issue and
was removed). Licensing for dependencies of new modules is okay (all
Apache).

   new:
   > apache-beam/runners/apex/pom.xml
   > apache-beam/sdks/java/extensions/sorter/pom.xml
   > apache-beam/sdks/java/maven-archetypes/examples-java8/pom.xml
   >
apache-beam/sdks/java/maven-archetypes/examples-java8/src/main/resources/archetype-resources/pom.xml

   removed:
   < apache-beam/sdks/java/microbenchmarks/pom.xml

* No occurrences of the substring `incub` in the source zip.

* Ran all additional postcommits in Jenkins against the release tag, and
all passed.

So, looks good to me!

+1

Dan

On Wed, Dec 28, 2016 at 11:39 PM, Jean-Baptiste Onofré 
wrote:

> Minor fix & update: the source code tag is obviously v0.4.0-RC1
>
> https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=r
> efs/tags/v0.4.0-RC1
>
> I launched Jenkins on the tag and it passed:
>
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> _MavenInstall/2245/
>
> Regards
> JB
>
>
> On 12/29/2016 08:33 AM, Jean-Baptiste Onofré wrote:
>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the version
>> 0.4.0, as follows:
>>
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint C8282E76 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v1.2.3-RC3" [5],
>> * website pull request listing the release and publishing the API
>> reference manual [6].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PPMC affirmative votes.
>>
>> Thanks,
>> JB
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
>> ctId=12319527=12338590
>>
>> [2] https://dist.apache.org/repos/dist/dev/beam/0.4.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapache
>> beam-1009/
>> [5]
>> https://git-wip-us.apache.org/repos/asf?p=beam.git;a=tag;h=a
>> b73a243ccfdae18f81435bfcf9de21c195fef4d
>>
>> [6] https://github.com/apache/beam-site/pull/117
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: PCollection to PCollection Conversion

2016-12-29 Thread Dan Halperin
On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> I agree MapElements isn't hard to use. I think there is a demand for this
> built-in conversion.
>
> My thought on the formatter is that, worst case, we could do runtime type
> checking. It would be ugly and not as performant, but it should work. As
> we've said, we'd point them to MapElements for better code. We'd write the
> JavaDoc accordingly.
>

I think it will be good to see these proposals in PR form. I would stay far
away from reflection and varargs if possible, but properly-typed bits of
code (possibly exposed as SerializableFunctions in ToString?) would
probably make sense.

In the short-term, I can't find anyone arguing against a ToString.create()
that simply does input.toString().

To get started, how about we ask Vikas to clean up the PR to be more
future-proof for now? Aka make `ToString` itself not a PTransform,  but
instead ToString.create() returns ToString.Default which is a private class
implementing what ToString is now (PTransform<T, String>, wrapping
MapElements).

Then we can send PRs adding new features to that.

IME and to Ben's point, these will mostly be used in development. Some of
> our assumptions will break down when programmers aren't the ones using
> Beam. I can see from the user traffic already that not everyone using Beam
> is a programmer and they'll need classes like this to be productive.


> On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin <dhalp...@google.com.invalid>
> wrote:
>
> On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
> wrote:
>
> > I prefer JB's take. I think there should be three overloaded methods on
> the
> > class. I like Vikas' name ToString. The methods for a simple conversion
> > should be:
> >
> > ToString.strings() - Outputs the .toString() of the objects in the
> > PCollection
> > ToString.strings(String delimiter) - Outputs the .toString() of KVs,
> Lists,
> > etc with the delimiter between every entry
> > ToString.formatted(String format) - Outputs the formatted
> > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > string
> > with the object passed in. For objects made up of different parts like
> KVs,
> > each one is passed in as separate toString() of a varargs.
> >
>
> Riffing a little, with some types:
>
> ToString.of() -- PTransform<T, String> that is equivalent to a ParDo
> that takes in a T and outputs T.toString().
>
> ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
> equivalent to a ParDo that takes in a KV<K,V> and outputs
> kv.getKey().toString() + delimiter + kv.getValue().toString()
>
> ToString.iterable(String delimiter) -- PTransform,
> String> that is equivalent to a ParDo that takes in an Iterable and
> outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> delimiter + iterable[N-1]
>
> ToString.custom(SerializableFunction<T, String> formatter) ?
>
> The last one is just MapElement.via, except you don't need to set the
> output type.
>
> I don't see a way to make the generic .formatted() that you propose that
> just works with anything "made of different parts".
>
> I think this adding too many overrides beyond "of" and "custom" is opening
> up a Pandora's Box. the KV one might want to have left and right
> delimiters, might want to take custom formatters for K and V, etc. etc. The
> iterable one might want to have a special configuration for an empty
> iterable. So I'm inclined towards simplicity with the awareness that
> MapElements.via is just not that hard to use.
>
> Dan
>
>
> >
> > I think doing these three methods would cover every simple and advanced
> > "simple conversions." As JB says, we'll need other specific converters
> for
> > other formats like XML.
> >
> > I'd really like to see this class in the next version of Beam. What does
> > everyone think of the class name, methods name, and method operations so
> we
> > can have Vikas finish up?
> >
> > Thanks,
> >
> > Jesse
> >
> > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi Vikas,
> > >
> > > did you take a look on:
> > >
> > >
> > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > java/extensions/dataformat
> > >
> > > You can see KV2String and ToString could be part of this extension.
> > > I'm also using JAXB for XML and Jackson for JSON
> > > marshalling/unmarshalling. I'm planning to de

Re: Build failed in Jenkins: beam_PostCommit_Java_RunnableOnService_Spark #574

2016-12-29 Thread Dan Halperin
Manual build by me in release testing -- I entered the wrong tag. Please
ignore.

On Thu, Dec 29, 2016 at 2:30 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Spark/574/>
>
> --
> Started by user dhalperi
> [EnvInject] - Loading node environment variables.
> Building remotely on beam1 (beam) in workspace  job/beam_PostCommit_Java_RunnableOnService_Spark/ws/>
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/beam.git #
> timeout=10
> Fetching upstream changes from https://github.com/apache/beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/beam.git +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse origin/masterv0.4.0-RC1^{commit} # timeout=10
>  > git rev-parse masterv0.4.0-RC1^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
> Retrying after 10 seconds
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/beam.git #
> timeout=10
> Fetching upstream changes from https://github.com/apache/beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/beam.git +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse origin/masterv0.4.0-RC1^{commit} # timeout=10
>  > git rev-parse masterv0.4.0-RC1^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
> Retrying after 10 seconds
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/beam.git #
> timeout=10
> Fetching upstream changes from https://github.com/apache/beam.git
>  > git --version # timeout=10
>  > git -c core.askpass=true fetch --tags --progress
> https://github.com/apache/beam.git +refs/heads/*:refs/remotes/origin/*
> +refs/pull/*:refs/remotes/origin/pr/*
>  > git rev-parse origin/masterv0.4.0-RC1^{commit} # timeout=10
>  > git rev-parse masterv0.4.0-RC1^{commit} # timeout=10
> ERROR: Couldn't find any revision to build. Verify the repository and
> branch configuration for this job.
>
>


Re: PCollection to PCollection Conversion

2016-12-29 Thread Dan Halperin
On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson 
wrote:

> I prefer JB's take. I think there should be three overloaded methods on the
> class. I like Vikas' name ToString. The methods for a simple conversion
> should be:
>
> ToString.strings() - Outputs the .toString() of the objects in the
> PCollection
> ToString.strings(String delimiter) - Outputs the .toString() of KVs, Lists,
> etc with the delimiter between every entry
> ToString.formatted(String format) - Outputs the formatted
> 
> string
> with the object passed in. For objects made up of different parts like KVs,
> each one is passed in as separate toString() of a varargs.
>

Riffing a little, with some types:

ToString.of() -- PTransform that is equivalent to a ParDo
that takes in a T and outputs T.toString().

ToString.kv(String delimiter) -- PTransform, String> that is
equivalent to a ParDo that takes in a KV and outputs
kv.getKey().toString() + delimiter + kv.getValue().toString()

ToString.iterable(String delimiter) -- PTransform,
String> that is equivalent to a ParDo that takes in an Iterable and
outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
delimiter + iterable[N-1]

ToString.custom(SerializableFunction formatter) ?

The last one is just MapElement.via, except you don't need to set the
output type.

I don't see a way to make the generic .formatted() that you propose that
just works with anything "made of different parts".

I think this adding too many overrides beyond "of" and "custom" is opening
up a Pandora's Box. the KV one might want to have left and right
delimiters, might want to take custom formatters for K and V, etc. etc. The
iterable one might want to have a special configuration for an empty
iterable. So I'm inclined towards simplicity with the awareness that
MapElements.via is just not that hard to use.

Dan


>
> I think doing these three methods would cover every simple and advanced
> "simple conversions." As JB says, we'll need other specific converters for
> other formats like XML.
>
> I'd really like to see this class in the next version of Beam. What does
> everyone think of the class name, methods name, and method operations so we
> can have Vikas finish up?
>
> Thanks,
>
> Jesse
>
> On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Vikas,
> >
> > did you take a look on:
> >
> >
> > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> java/extensions/dataformat
> >
> > You can see KV2String and ToString could be part of this extension.
> > I'm also using JAXB for XML and Jackson for JSON
> > marshalling/unmarshalling. I'm planning to deal with Avro
> (IndexedRecord).
> >
> > Regards
> > JB
> >
> > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > Hi All,
> > >
> > >   Not being aware of the discussion here, I sent out a PR
> > >  but JB and others directed
> > me to
> > > this thread. Having converted PCollection to PCollection
> > several
> > > times, I feel something like 'ToString' transform is common enough to
> be
> > > part of the core. What do you all think?
> > >
> > > Also, if someone else is already working on or interested in tackling
> > this,
> > > then I am happy to discard the PR.
> > >
> > > Regards,
> > > Vikas
> > >
> > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela 
> wrote:
> > >
> > >> It seems that there were a lot of good points raised here, and I tend
> to
> > >> agree that something as trivial and lean as "ToString" should be a
> part
> > of
> > >> core.ake
> > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> various
> > >> combinations (Scala-like).
> > >> For "fromString", I think JB has a good point leveraging JAXB and
> > Jackson -
> > >> though I think this should be in extensions as it is not as lean as
> > >> toString.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré  >
> > >> wrote:
> > >>
> > >>> Hi Jesse,
> > >>>
> > >>> yes, I started something there (using JAXB and Jackson). Let me
> polish
> > >>> and push.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> >  I went through the string conversions. Do you have an example of
> > >> writing
> >  out XML/JSON/etc too?
> > 
> >  On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> >  wrote:
> > 
> > > Hi Jesse,
> > >
> > >
> > >
> > >>> https://github.com/jbonofre/incubator-beam/tree/
> DATAFORMAT/sdks/java/
> > >> extensions/dataformat
> > >
> > > it's very simple and stupid and of course not complete at all (I
> have
> > > other commits but not merged as they need some polishing), but as I
> > > said, it's a base of discussion.
> > >
> > > 

Re: Running a Specific Test

2016-12-29 Thread Dan Halperin
If you'd like early eyes on the blog post, let us know. Happy to review!

One thing worth noting: we've tried to structure Beam so that the pain is
mostly limited to the core. Many modules have module-specific unit tests
that use DirectRunner directly. The module simply has a test dependency on
DirectRunner, and unit tests that expect the DirectRunner to be there "just
work". It's only the 2 modules the DirectRunner depends on directly
(sdk-core and runners-core) that have this pain.

Now for tests that should work on *any* runner, there is similar
customization -- @RunnableOnService (today, some better name tomorrow) and
runnable-on-service-tests, etc. etc.

Dan

On Thu, Dec 29, 2016 at 12:42 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> Thanks to everyone for their help. I'm writing a blog about the various
> Maven things you need to know with Beam.
>
> @Dan that command line worked. Thanks!
>
> On Thu, Dec 29, 2016 at 11:23 AM Stas Levin <stasle...@gmail.com> wrote:
>
> > I believe you raise a good point :)
> >
> > On Thu, Dec 29, 2016 at 9:00 PM Dan Halperin <dhalp...@google.com.invalid
> >
> > wrote:
> >
> > > I suspect -- but may be wrong -- that the command line Stas gives will
> > use
> > > the *installed* version of beam-sdks-java-core. If you are iterating
> on a
> > > @NeedsRunner test in the SDK core, you will either need to reinstall it
> > > over and over again, or use `-am` to force recompilation of the core.
> > >
> > > Here is a command that works for me. Please criticize :)
> > >
> > > mvn -Dtest=org.apache.beam.sdk.transforms.RegexTest
> -DfailIfNoTests=false
> > > -pl runners/direct-java -am integration-test
> > >
> > > Note that this is an `integration-test`, not a `test` because it tests
> > the
> > > integration of the SDK with the DirectRunner:
> > >
> > https://github.com/apache/beam/blob/master/runners/direct-
> java/pom.xml#L64
> > >
> > > Dan
> > >
> > > On Thu, Dec 29, 2016 at 10:53 AM, Stas Levin <stasle...@gmail.com>
> > wrote:
> > >
> > > > P.S
> > > > You can also do this from the main directory (without cd-ing into the
> > > > direct-runner):
> > > >
> > > > "mvn test -Dtest=RegexTest
> > > > -DdependenciesToScan=org.apache.beam:beam-sdks-java-core -pl
> > > > runners/direct-java"
> > > >
> > > > On Thu, Dec 29, 2016 at 8:50 PM Stas Levin <stasle...@gmail.com>
> > wrote:
> > > >
> > > > > Once you "cd" into "runners/direct-java" you can use:
> > > > >
> > > > > "mvn test -Dtest=RegexTest
> > > > > -DdependenciesToScan=org.apache.beam:beam-sdks-java-core"
> > > > >
> > > > > -Stas
> > > > >
> > > > > On Thu, Dec 29, 2016 at 8:27 PM Jesse Anderson <
> > je...@smokinghand.com>
> > > > > wrote:
> > > > >
> > > > > I tried that one already. It gives a no tests run error. If you
> > bypass
> > > > that
> > > > > error with -DfailIfNoTests=false, no tests get run at all.
> > > > >
> > > > > On Thu, Dec 29, 2016 at 10:20 AM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jesse
> > > > > >
> > > > > > Mvn test -Dtest=RegexTest
> > > > > >
> > > > > > Should work
> > > > > >
> > > > > > Don't forget the test goal. And no need to provide the fqcn.
> > > > > >
> > > > > > Regards
> > > > > > JB⁣​
> > > > > >
> > > > > > On Dec 29, 2016, 18:55, at 18:55, Jesse Anderson <
> > > > je...@smokinghand.com>
> > > > > > wrote:
> > > > > > >Does anyone know the Maven way to run a specific unit test with
> > > Beam?
> > > > > > >I've
> > > > > > >tried:
> > > > > > >mvn -Dtest=org.apache.beam.sdk.transforms.RegexTest
> > > > > > >-DfailIfNoTests=false
> > > > > > >-Dgroups="org.apache.beam.sdk.testing.NeedsRunner" -pl
> > > > > > >org.apache.beam:beam-sdks-java-core test
> > > > > > >
> > > > > > >The test still doesn't run. Does anyone know what I'm missing?
> > > > > > >
> > > > > > >Thanks,
> > > > > > >
> > > > > > >Jesse
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>