from:"Udi Meiri"

Re: Azure(ADLS) compatibility on Beam with Spark runner

2017-11-28 Thread Udi Meiri

Hi JB,
I'm working on adding HDFS support to the Python runner.
We're planning on using libhdfs3, which doesn't seem to support anything
other than HDFS.


On Mon, Nov 27, 2017 at 12:44 PM Lukasz Cwik 
wrote:

> Out of curiosity, does using the DirectRunner with ADL work for you?
> If not, then you'll be able to debug locally why its failing.
>
> On Fri, Nov 24, 2017 at 8:09 PM, Milan Chandna <
> milan.chan...@microsoft.com.invalid> wrote:
>
> > Hi JB,
> >
> > Thanks for the updates.
> > BTW I am myself in Microsoft but I am trying this out of my interest.
> > And it's good to know that someone else is also working on this.
> >
> > -Milan.
> >
> > -Original Message-
> > From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> > Sent: Thursday, November 23, 2017 1:47 PM
> > To: dev@beam.apache.org
> > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> >
> > The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems
> > they didn't succeed.
> > The new approach we plan is to directly use the ADLS API.
> >
> > I keep you posted.
> >
> > Regards
> > JB
> >
> > On 11/23/2017 07:42 AM, Milan Chandna wrote:
> > > I tried both the ways.
> > > Passed ADL specific configuration in --hdfsConfiguration as well and
> > have setup the core-site.xml/hdfs-site.xml as well.
> > > As I mentioned it's a HDI + Spark cluster, those things are already
> > setup.
> > > Spark job(without Beam) is also able to read and write to ADLS on same
> > machine.
> > >
> > > BTW if the authentication or understanding ADL was a problem, it would
> > have thrown error like ADLFileSystem missing or probably access failed or
> > something. Thoughts?
> > >
> > > -Milan.
> > >
> > > -Original Message-
> > > From: Lukasz Cwik [mailto:lc...@google.com.INVALID]
> > > Sent: Thursday, November 23, 2017 5:05 AM
> > > To: dev@beam.apache.org
> > > Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> > >
> > > In your example it seems as though your HDFS configuration doesn't
> > contain any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.
> > defaultFS\":
> > > \"hdfs://home/sample.txt\"]'"
> > > Do you have a core-site.xml or hdfs-site.xml configured as per:
> > > https://na01.safelinks.protection.outlook.com/?url=
> > https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-
> > azure-datalake%2Findex.html=02%7C01%7CMilan.Chandna%40microsoft.com
> %
> > 7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011
> > db47%7C1%7C0%7C636469905161638292=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1
> > Gft8eka5W7Yts%3D=0?
> > >
> > >  From the documentation for --hdfsConfiguration:
> > > A list of Hadoop configurations used to configure zero or more Hadoop
> > filesystems. By default, Hadoop configuration is loaded from
> > 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
> > YARN_CONF_DIR environment variables. To specify configuration on the
> > command-line, represent the value as a JSON list of JSON maps, where each
> > map represents the entire configuration for a single Hadoop filesystem.
> For
> > example --hdfsConfiguration='[{\"fs.default.name\":
> > > \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\",
> > ...},...]'
> > > From:
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > > b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> > > 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> > > apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45
> > > ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> > > 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> > > 92=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D=0
> > >
> > > On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré
> > > 
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> FYI, I'm in touch with Microsoft Azure team about that.
> > >>
> > >> We are testing the ADLS support via HDFS.
> > >>
> > >> I keep you posted.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 11/22/2017 09:12 AM, Milan Chandna wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
> > >>> I was trying recently to do this but was unable to make it work.
> > >>>
> > >>> Steps that I tried:
> > >>>
> > >>> 1.  Took HDI + Spark 1.6 cluster with default storage as ADLS
> > account.
> > >>> 2.  Built Apache Beam on that. Built to include Beam-2790<
> > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> > >>> u
> > >>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790=02%7C01%
> > 7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%
> > 7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292=aj%
> > 2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D=0> fix which
> > earlier I was facing for ADL as well.
> > >>> 3.  Modified WordCount.java example to use
>

Re: [jira] [Commented] (BEAM-3357) Python SDK head fails to run tests due to Requirement.parse('protobuf<=3.4.0,>=3.2.0')

2017-12-15 Thread Udi Meiri

Yeah, I guess that's the difference between releasing source vs running
your own service.

On Fri, Dec 15, 2017 at 2:58 PM Ahmet Altay <al...@google.com> wrote:

> I agree with Robert. Also there is usually a workaround to unbrake
> previous version by installing a specific version of an offending
> dependency.
>
>
> On Fri, Dec 15, 2017 at 2:53 PM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>> This has the downside of pinning the dependencies for all downstream
>> projects, making it impossible for them to use different versions than
>> the ones we happened to choose. (Imagine the pain of two or more of
>> our dependencies pinned all their dependencies...)
>>
>> On Fri, Dec 15, 2017 at 2:48 PM, Udi Meiri <eh...@google.com> wrote:
>> > +1 to pinning to exact versions, to be sure that our releases do not
>> break
>> > when newer versions of dependencies are released.
>> >
>> > On Fri, Dec 15, 2017 at 2:44 PM Ahmet Altay <al...@google.com> wrote:
>> >>
>> >> On Fri, Dec 15, 2017 at 2:42 PM, Chamikara Jayalath <
>> chamik...@google.com>
>> >> wrote:
>> >>>
>> >>> +1 for automating the process of checking for possible version bumps.
>> >>>
>> >>> Also, what do you think about pinning dependencies to exact versions
>> >>> (instead of ranges) after cutting a release branch ? This should
>> improve the
>> >>> stability of released SDKs (but not a prefect solution since
>> transitive
>> >>> dependencies can still change).
>> >>
>> >>
>> >> This is a reasonable suggestion. The issue with that is, by being less
>> >> flexible we will prevent users from using latest versions of
>> dependencies.
>> >> On the other hand it will prevent breaking of already released
>> versions.
>> >>
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Cham
>> >>>
>> >>> On Fri, Dec 15, 2017 at 2:19 PM Ahmet Altay <al...@google.com> wrote:
>> >>>>
>> >>>> On Fri, Dec 15, 2017 at 2:02 PM, Robert Bradshaw <
>> rober...@google.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> On Fri, Dec 15, 2017 at 1:51 PM, Ahmet Altay <al...@google.com>
>> wrote:
>> >>>>> >
>> >>>>> > On Fri, Dec 15, 2017 at 1:38 PM, Robert Bradshaw
>> >>>>> > <rober...@google.com>
>> >>>>> > wrote:
>> >>>>> >>
>> >>>>> >> I am also in favor of pinning as an immediate fix, bumping the
>> bound
>> >>>>> >> otherwise.
>> >>>>> >>
>> >>>>> >> Regarding putting an upper bound to avoid being broken, the last
>> two
>> >>>>> >> breaks have been due to just having an (unneeded) upper bound
>> (which
>> >>>>> >> held us back to broken/incompatible releases in relationship to
>> >>>>> >> other
>> >>>>> >> dependencies). We should try to trust semantic versioning when
>> >>>>> >> possible, and when not we must regularly audit.
>> >>>>> >
>> >>>>> > +1 to this, especially the auditing part. We also had breaks
>> because
>> >>>>> > we
>> >>>>> > trusted semantic versioning. So far our semi-official policy was
>> to
>> >>>>> > trust a
>> >>>>> > package until they prove it otherwise. I will argue that grpc
>> here is
>> >>>>> > making
>> >>>>> > a breaking change in a minor version increment by changing the way
>> >>>>> > they are
>> >>>>> > depending on a major package.
>> >>>>>
>> >>>>> A minor version bump should be allowed to require a minor version
>> bump
>> >>>>> in its dependencies.
>> >>>>>
>> >>>>> > We have done a good job of auditing and updating those pinned (or
>> >>>>> > upper
>> >>>>> > bounded) dependencies, and probably we are behind in some of
>> those.
>> >>>>> >
>> >>>>> > I wonder if we can automate some of this? If we can

HDFS Support for Python SDK

2017-11-20 Thread Udi Meiri

Hi,

I've done some research into implementing HDFS support for Python SDK and
I'd like your input. This work is regarding BEAM-3099
.

This doc lists several options for implementing HDFS support and attempts
to weigh the differences.
https://docs.google.com/document/d/1-uzKf4VPlGrkBMXM00sxxf3K01Ss3ZzXeju0w5L0LY0/edit?usp=sharing


Thanks,
Udi


smime.p7s
Description: S/MIME Cryptographic Signature

Pubsub on directrunner: direct_runner.py and transform_evaluator.py

2018-04-27 Thread Udi Meiri

Hi,
I'm having trouble understanding why there's an extra level of indirection
when doing pubsub reads via directrunner vs writes.

For reads, we have these translations:
beam_pubsub.ReadFromPubSub ->
direct_runner._DirectReadFromPubSub ->
transform_evaluator._PubSubReadEvaluator

For writes, this is abbreviated:
beam_pubsub.WriteStringsToPubSub ->
_DirectWriteToPubSub

What is the role of transform_evaluator._TransformEvaluator?
Why do we need it for reads and not for writes?


smime.p7s
Description: S/MIME Cryptographic Signature

Documentation for Beam on Windows

2018-05-23 Thread Udi Meiri

Hi all,

I was looking yesterday for a quickstart guide on how to use Beam on
Windows but saw that those guides are exclusively for Linux users.

What documentation is available for people wanting to use Beam on Windows
machines?

Thanks!


smime.p7s
Description: S/MIME Cryptographic Signature

Proposal: keeping precommit times fast

2018-05-17 Thread Udi Meiri

HI,
I have a proposal to improve contributor experience by keeping precommit
times low.

I'm looking to get community consensus and approval about:
1. How long should precommits take. 2 hours @95th percentile over the past
4 weeks is the current proposal.
2. The process for dealing with slowness. Do we: fix, roll back, remove a
test from precommit?
Rolling back if a fix is estimated to take longer than 2 weeks is the
current proposal.

https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing


smime.p7s
Description: S/MIME Cryptographic Signature

Re: I'm back and ready to help grow our community!

2018-05-17 Thread Udi Meiri

Welcome back and congrats again!

On Thu, May 17, 2018 at 2:23 PM Dmitry Demeshchuk 
wrote:

> While this may be a bit off topic, I still want to say this.
>
> Congratulations on your graduation, Gris!
>
> On Thu, May 17, 2018 at 2:19 PM, Griselda Cuevas  wrote:
>
>> Hi Everyone,
>>
>>
>> I was absent from the mailing list, slack channel and our Beam community
>> for the past six weeks, the reason was that I took a leave to focus on
>> finishing my Masters Degree, which I finally did on May 15th.
>>
>>
>> I graduated as a Masters of Engineering in Operations Research with a
>> concentration in Data Science from UC Berkeley. I'm glad to be part of this
>> community and I'd like to share this accomplishment with you so I'm adding
>> two pictures of that day :)
>>
>>
>> Given that I've seen so many new folks around, I'd like to use this
>> opportunity to re-introduce myself. I'm Gris Cuevas and I work at Google.
>> Now that I'm back, I'll continue to work on supporting our community in two
>> main streams: Contribution Experience & Events, Meetups, and Conferences.
>>
>>
>> It's good to be back and I look forward to collaborating with you.
>>
>>
>> Cheers,
>>
>> Gris
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Udi Meiri

Another bug: reading from PubSub with_attributes=True is broken on Python
with Dataflow.
https://issues.apache.org/jira/browse/BEAM-4536

JB, I'm making a PR that removes this keyword and I'd like to propose it as
a cherrypick to 2.5.0.
(feature should be fixed in the next release)

On Mon, Jun 11, 2018 at 6:19 PM Chamikara Jayalath 
wrote:

> FYI: looks like Python tests are failing for Windows. JIRA is
> https://issues.apache.org/jira/browse/BEAM-4535.
>
> I don't think this is a release blocker but this should probably go in
> release notes (for any user that tries to run tests on Python source
> build). And we should try to incorporate a fix if we happen to cut another
> release candidate for some reason.
>
> Thanks,
> Cham
>
> On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada  wrote:
>
>> Thanks everyone who has pitched in to validate the release!
>>
>> Boyuan Zhang and I have also run a few pipelines, and verified that they
>> work properly (see release validation spreadsheet[1]).
>>
>> We have also found that the Game Stats pipeline is failing in Python
>> Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker, since
>> Python streaming is not yet fully supported.
>>
>> It seems that the uploaded artifacts look good.
>>
>> We have noticed that the Python artifacts are still missing Python wheel
>> files (compare [3] and [4]). JB, could you please add the wheel files?
>> Boyuan and I can try to help you prepare them / upload them if necessary.
>> Please let us know.
>>
>> Thanks again!
>> -P.
>>
>> [1]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
>> [2] https://issues.apache.org/jira/browse/BEAM-4534
>> [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
>> [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>>
>> On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> tested some of the quickstarts
>>>
>>> On Sun, Jun 10, 2018 at 1:39 AM Tim  wrote:
>>>
 Tested by our team:
 - mvn inclusion
 - Avro, ES, Hadoop IF IO
 - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
 - Reviewed release notes

 +1

 Thanks also to everyone who helped get over the gradle hurdle and in
 particular to JB.

 Tim

 > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré 
 wrote:
 >
 > No problem Pablo.
 >
 > The vote period is a minimum, it can be extended as requested or if we
 > don't have the minimum of 3 binding votes.
 >
 > Regards
 > JB
 >
 >> On 09/06/2018 01:54, Pablo Estrada wrote:
 >> Hello all,
 >> I'd like to request an extension of the voting period until Monday
 >> evening (US time, so later in other geographical regions). This is
 >> because we were only now able to publish Dataflow Workers, and have
 not
 >> had the chance to run release validation tests on them. The extension
 >> will allow us to validate and vote by Monday.
 >> Is this acceptable to the community?
 >>
 >> Best
 >> -P.
 >>
 >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
 >> mailto:aromanenko@gmail.com>> wrote:
 >>
 >>Thank you JB for your work!
 >>
 >>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
 >>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
 >>
 >>WBR,
 >>Alexey
 >>
 >>
 >>>On 8 Jun 2018, at 10:00, Etienne Chauchot >>> >>>> wrote:
 >>>
 >>>I forgot to vote:
 >>>+1 (non binding).
 >>>What I tested:
 >>>- no functional or performance regression comparing to v2.4
 >>>- dependencies in the poms are ok
 >>>
 >>>Etienne
 Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a
 écrit :
 +1 (non-binding), mainstream usage is not broken by the pom
 changes and runtime has no known regression compared to the
 2.4.0
 
 (side note: kudo to JB for this build tool change release, I
 know
 how it can hurt ;))
 
 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 <
 https://www.packtpub.com/application-development/java-ee-8-high-performance
 >
 
 
 Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
 mailto:j...@nanthrax.net>> a écrit :
 >Thanks for the details Etienne !
 >
 >The good news is that the artifacts seem OK and the overall
 Nexmark
 >results are consistent with the 2.4.0 release ones.
 >

Re: Proposal: keeping precommit times fast

2018-06-07 Thread Udi Meiri

Would I need a vote on installing this plugin, or can I just open a ticket
to infra?

On Wed, Jun 6, 2018, 16:18 Robert Bradshaw  wrote:

> Even if it's not perfect, seems like it'd surely be a net win (and
> probably a large one). Also, the build cache should look back at more than
> just the single previous build, so if any previous jobs (up to the cache
> size limit) built/tested artifacts unchanged by the current PR, the results
> would live in the cache.
>
> I would look at (a) and (b) only if this isn't already good enough.
>
> On Wed, Jun 6, 2018 at 3:50 PM Udi Meiri  wrote:
>
>> To follow up on the Jenkins Job Cacher Plugin:
>>
>> Using a Jenkins plugin to save and reuse the Gradle cache for successive
>> precommit jobs.
>> The problem with this approach is that the precommit runs that a Jenkins
>> server runs are unrelated.
>> Say you have 2 PRs, A and B, and the precommit job for B reuses the cache
>> left by the job for A.
>> The diff between the two will cause tests affected both by A and B to be
>> rerun (at least).
>> If A modifies Python code, then the job for B must rerun ALL Python tests
>> (since Gradle doesn't do dependency tracking for Python).
>>
>> Proposal:
>> a. The cache plugin is still useful for successive Java precommit jobs,
>> but not for Python. (Go, I have no idea)
>> We could use it exclusively for Java precommits.
>> b. To avoid running precommit jobs for code not touched by a PR, look at
>> the paths of files changed.
>> For example, a PR touching only files under sdks/python/... need only run
>> Python precommit tests.
>>
>> On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri  wrote:
>>
>>> I've been having a separate discussion on the proposal doc, which is
>>> ready for another round of reviews.
>>> Change summary:
>>> - Changed fast requirement to be < 30 minutes and simplify the check as
>>> an aggregate for each precommit job type.
>>> - Updated slowness notification methods to include automated methods: as
>>> a precommit check result type on GitHub, as a bug.
>>> - Merged in the metrics design doc.
>>> - Added detailed design section.
>>> - Added list of deliverables.
>>>
>>> What I would like is consensus regarding:
>>> - How fast we want precommit runs to be. I propose 30m.
>>> - Deadline for fixing a slow test before it is temporarily removed from
>>> precommit. I propose 24 hours.
>>>
>>>
>>> Replying to the thread:
>>>
>>> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
>>> unaffected tests (BEAM-4400).
>>>
>>> 2. Java Precommit tests include integration tests (example
>>> <https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/>
>>> ).
>>> We could split these out to get much faster results, i.e., a separate
>>> precommit just for basic integration tests (which will still need to run in
>>> <30m).
>>> Perhaps lint checks for Python could be split out as well.
>>>
>>> I'll add these suggestions to the doc tomorrow.
>>>
>>> On Thu, May 24, 2018 at 9:25 AM Scott Wegner  wrote:
>>>
>>>> So, it sounds like there's agreement that we should improve precommit
>>>> times by only running necessary tests, and configuring Jenkins Job
>>>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>>>> [1] to follow-up on this.
>>>>
>>>> Getting back to Udi's original proposal [2]: I see value in defining a
>>>> metric and target for overall pre-commit timing. The proposal for an
>>>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>>>> it, but if we drift to a point where we're not, that should trigger some
>>>> action to be taken to get back to a healthy state.
>>>>
>>>> I wouldn't mind separately setting a more aspiration goal of getting
>>>> the pre-commits even faster (i.e. 15-30 mins), but I suspect that would
>>>> require a concerted effort to evaluate and improve existing tests across
>>>> the codebase. One idea would be to set up ensure the metric reporting can
>>>> show the trend, and which tests are responsible for the most walltime, so
>>>> that we know where to invest any efforts to improve tests.
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>>>> [2]
>>

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-08 Thread Udi Meiri

(a) Yes.
(b) I'm interested in putting documentation for contributors there. (test
triage guide, precommit and postcommit guidelines, processes, etc.)
It'd be faster than having to go through the motions of a github pull
request and a review process.
(c) Anything that goes to a wide audience, such as SDK users. That would
need review.

Also, have you looked at https://wiki.apache.org/general/ ? (not sure if
that's Confluence)


On Fri, Jun 8, 2018 at 10:07 AM Andrew Pilloud  wrote:

> +1 It would be really nice to have a lightweight place to share that is
> more searchable then random Google docs.
>
> Andrew
>
> On Fri, Jun 8, 2018 at 9:35 AM Anton Kedin  wrote:
>
>> +1
>>
>> (a) we should;
>> (b) I think it will be a good place for all of the things you list;
>> (c) introductory things, like "getting started", or "programming guide"
>> that people not deeply involved in the project would expect to find on
>> beam.apache.org should stay there, not in the wiki;
>>
>> On Fri, Jun 8, 2018 at 12:56 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi Kenn,
>>> I'm +1 of course. I remember that you and I and others discussed in a
>>> similar thread about dev facing docs but it got lost at some point in time.
>>> IMHO
>>>
>>> I would add
>>> - runners specifics (e.g. how runners implement state or timer, how they
>>> split data into bundles, etc...)
>>> - probably putting online the doc I did for nexmark that summarizes the
>>> architecture and pseudo code of the queries (because some are several
>>> thousand lines of code). I often use it to recall what a given query does.
>>>
>>> I would remove
>>>  - BIPs / summaries of collections of JIRA
>>> because it is hard to maintain and will end up being out of date I think.
>>>
>>> Etienne
>>>
>>> Le jeudi 07 juin 2018 à 13:23 -0700, Kenneth Knowles a écrit :
>>>
>>> Hi all,
>>>
>>> I've been in half a dozen conversations recently about whether to have a
>>> wiki and what to use it for. Some things I've heard:
>>>
>>>  - "why is all this stuff that users don't care about here?"
>>>  - "can we have a lighter weight place to put technical references for
>>> contributors"
>>>
>>> So I want to consider as a community starting up our wiki. Ideas for
>>> what could go there:
>>>
>>>  - Collection of links to design docs like
>>> https://beam.apache.org/contribute/design-documents/
>>>  - Specialized walkthroughs like
>>> https://beam.apache.org/contribute/docker-images/
>>>  - Best-effort notes that just try to help out like
>>> https://beam.apache.org/contribute/intellij/
>>>  - Docs on in-progress stuff like
>>> https://beam.apache.org/documentation/runners/jstorm/
>>>  - Expanded instructions for committers, more than
>>> https://beam.apache.org/contribute/committer-guide/
>>>  - BIPs / summaries of collections of JIRA
>>>  - Docs sitting in markdown in the repo like
>>> https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md and
>>> https://github.com/apache/beam-site/blob/asf-site/README.md (which will
>>> soon not be a toplevel README)
>>>
>>> What do you think?
>>>
>>> (a) should we do it?
>>> (b) what should go there?
>>> (c) what should not go there?
>>>
>>> Kenn
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-12 Thread Udi Meiri

Cherrypick created: https://github.com/apache/beam/pull/5607
Tests still running.

On Tue, Jun 12, 2018 at 9:59 AM Udi Meiri  wrote:

> -1: Would like to cherry pick a fix for
> https://issues.apache.org/jira/browse/BEAM-4536
>
> On Tue, Jun 12, 2018 at 9:48 AM Jean-Baptiste Onofré 
> wrote:
>
>> Any update about your vote and fix ?
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 12/06/2018 04:02, Udi Meiri wrote:
>> > Another bug: reading from PubSub with_attributes=True is broken on
>> > Python with Dataflow.
>> > https://issues.apache.org/jira/browse/BEAM-4536
>> >
>> > JB, I'm making a PR that removes this keyword and I'd like to propose it
>> > as a cherrypick to 2.5.0.
>> > (feature should be fixed in the next release)
>> >
>> > On Mon, Jun 11, 2018 at 6:19 PM Chamikara Jayalath <
>> chamik...@google.com
>> > <mailto:chamik...@google.com>> wrote:
>> >
>> > FYI: looks like Python tests are failing for Windows. JIRA
>> > is https://issues.apache.org/jira/browse/BEAM-4535.
>> >
>> > I don't think this is a release blocker but this should probably go
>> > in release notes (for any user that tries to run tests on Python
>> > source build). And we should try to incorporate a fix if we happen
>> > to cut another release candidate for some reason.
>> >
>> > Thanks,
>> > Cham
>> >
>> > On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada > > <mailto:pabl...@google.com>> wrote:
>> >
>> > Thanks everyone who has pitched in to validate the release!
>> >
>> > Boyuan Zhang and I have also run a few pipelines, and verified
>> > that they work properly (see release validation spreadsheet[1]).
>> >
>> > We have also found that the Game Stats pipeline is failing in
>> > Python Streaming Dataflow. I have filed BEAM-4534[2]. This is
>> > not a blocker, since Python streaming is not yet fully
>> supported.
>> >
>> > It seems that the uploaded artifacts look good.
>> >
>> > We have noticed that the Python artifacts are still missing
>> > Python wheel files (compare [3] and [4]). JB, could you please
>> > add the wheel files? Boyuan and I can try to help you prepare
>> > them / upload them if necessary. Please let us know.
>> >
>> > Thanks again!
>> > -P.
>> >
>> > [1]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
>> > [2] https://issues.apache.org/jira/browse/BEAM-4534
>> > [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
>> > [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>> >
>> > On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold
>> > mailto:amyrv...@google.com>> wrote:
>> >
>> > +1 (non-binding)
>> >
>> > tested some of the quickstarts
>> >
>> > On Sun, Jun 10, 2018 at 1:39 AM Tim
>> > > > <mailto:timrobertson...@gmail.com>> wrote:
>> >
>> > Tested by our team:
>> > - mvn inclusion
>> > - Avro, ES, Hadoop IF IO
>> > - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
>> > - Reviewed release notes
>> >
>> > +1
>> >
>> > Thanks also to everyone who helped get over the gradle
>> > hurdle and in particular to JB.
>> >
>> > Tim
>> >
>> > > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> wrote:
>> > >
>> > > No problem Pablo.
>> > >
>> > > The vote period is a minimum, it can be extended as
>> > requested or if we
>> > > don't have the minimum of 3 binding votes.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >> On 09/06/2018 01:54, Pablo Estrada wrote:
>> > >> Hello all,
>> >

Re: [CANCEL][VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-13 Thread Udi Meiri

+1 to ignoring flaky test.

FYI there's a fourth cherrypick: https://github.com/apache/beam/pull/5624

On Wed, Jun 13, 2018 at 3:45 PM Pablo Estrada  wrote:

> Sent out https://github.com/apache/beam/pull/5640 to ignore the flaky
> test. As JB is the release manager, I'l let him make the call on what to do
> about it.
> Best
> -P.
>
> On Wed, Jun 13, 2018 at 3:34 PM Ahmet Altay  wrote:
>
>> I would vote for second option, not a release blocker and disable the
>> test in the release branch. My reasoning is:
>> - ReferenceRunner is not yet the official alternative to existing direct
>> runners.
>> - It is bad to have flaky tests on the release branch, and we would not
>> get good signal during validation.
>>
>> On Wed, Jun 13, 2018 at 3:14 PM, Pablo Estrada 
>> wrote:
>>
>>> Hello all,
>>> cherrypicks for the release branch seem to be going well, but thanks to
>>> them we were able to surface a flaky test in the release branch. JIRA is
>>> filed: https://issues.apache.org/jira/projects/BEAM/issues/BEAM-4558
>>>
>>> Given that test issue, I see the following options:
>>> - Consider that this test is not a release blocker. Go ahead with RC2
>>> after cherrypicks are brought in, or
>>> - Consider that this test is not a release blocker, so we disable it
>>> before cutting RC2.
>>> - Consider this test a release blocker, and triage the bug for fixing.
>>>
>>> What do you think?
>>>
>>> Best
>>> -P.
>>>
>>> On Wed, Jun 13, 2018 at 9:54 AM Pablo Estrada 
>>> wrote:
>>>
 Precommits for PR https://github.com/apache/beam/pull/5609 are now
 passing. For now I've simply set failOnWarning to false to cherrypick into
 the release, and fix in master later on.
 Best
 -P.

 On Wed, Jun 13, 2018 at 9:08 AM Scott Wegner 
 wrote:

> From my understanding, the @SuppressFBWarnings usage is in a
> dependency (ByteBuddy) rather than directly in our code; so we're not able
> to modify the usage.
>
> Pablo, feel free to disable failOnWarning for the sdks-java-core
> project temporarily. This isn't a major regression since we've only
> recently made the change to enable it [1]. We can work separately on
> figuring out how to resolve the warnings.
>
> [1] https://github.com/apache/beam/pull/5319
>
> On Tue, Jun 12, 2018 at 11:57 PM Tim Robertson <
> timrobertson...@gmail.com> wrote:
>
>> Hi Pablo,
>>
>> I'm afraid I couldn't find one either... there is an issue about it
>> [1] which is old so it doesn't look likely to be resolved either.
>>
>> If you have time (sorry I am a bit busy) could you please verify the
>> version does work if you install that version locally? I know the maven
>> version of that [2] but not sure on the gradle equivalent. If we know it
>> works, we can then find a repository that fits ok with Apache/Beam 
>> policy.
>>
>> Alternatively, we could consider using a fully qualified reference
>> (i.e. @edu.umd.cs.findbugs.annotations.SuppressWarnings) to the 
>> deprecated
>> version and leave the dependency at the 1.3.9-1. I believe our general
>> direction is to remove findbugs when errorprone covers all aspects so I
>> *expect* this should be considered reasonable.
>>
>> I hope this helps,
>> Tim
>>
>> [1] https://github.com/stephenc/findbugs-annotations/issues/4
>> [2]
>> https://maven.apache.org/guides/mini/guide-3rd-party-jars-local.html
>>
>> On Wed, Jun 13, 2018 at 8:39 AM, Pablo Estrada 
>> wrote:
>>
>>> Hi Tim,
>>> you're right. Thanks for pointing that out. There's just one problem
>>> that I'm running into now: The 3.0.1-1 version does not seem to be
>>> available in Maven Central[1]. Looking at the website, I am not quite 
>>> sure
>>> if there's another repository where they do stage the newer versions?[2]
>>>
>>> -P
>>>
>>> [1]
>>> https://repo.maven.apache.org/maven2/com/github/stephenc/findbugs/findbugs-annotations
>>> /
>>> [2] http://stephenc.github.io/findbugs-annotations/
>>>
>>> On Tue, Jun 12, 2018 at 11:10 PM Tim Robertson <
>>> timrobertson...@gmail.com> wrote:
>>>
 Hi Pablo,

 I took only a quick look.

 "- The JAR from the non-LGPL findbugs does not contain the
 SuppressFBWarnings annotation"

 Unless I misunderstand you it looks like SuppressFBWarnings was
 added in Stephen's version in this commit [1] which was introduced
 in version 2.0.3-1 -  I've checked is in the 3.0.1-1 build [2]
 I notice in your commits [1] you've been exploring version 3.0.0
 already though... what happens when you use 3.0.1-1? It sounds like the
 wrong version is coming in rather than the annotation being missing.

 Thanks,
 Tim

 [1]

Re: Precommits broken?

2018-06-14 Thread Udi Meiri

+1 for separate jobs if it gets us faster to pre-commit filtering

On Thu, Jun 14, 2018 at 11:22 AM Kenneth Knowles  wrote:

> I like Andrew's solution. Just totally separate jobs for automatic and
> manual.
>
> Kenn
>
> On Thu, Jun 14, 2018 at 9:56 AM Lukasz Cwik  wrote:
>
>> That seems like a pretty good interim solution.
>>
>> On Thu, Jun 14, 2018 at 9:53 AM Andrew Pilloud 
>> wrote:
>>
>>> If you always run one job for automated and another job for manual you
>>> wouldn't need to remember two trigger phrases. The automated jobs don't
>>> even need trigger phrases. As long as the status contexts are the same
>>> github users never have to know they are two separate jobs.
>>>
>>> Andrew
>>>
>>> On Thu, Jun 14, 2018 at 9:49 AM Lukasz Cwik  wrote:
>>>
 I thought of that as well but would find it annoying that I would need
 to remember two sets of triggers, the ones for the automated jobs and the
 ones for the manual runs. If we re-use the same precommit trigger phrase,
 we would get two runs (automated and manual) of effectively the same thing
 for the jobs where the automated one wouldn't get filtered out.

 On Thu, Jun 14, 2018 at 9:46 AM Andrew Pilloud 
 wrote:

> Might there be a third option of creating a different jenkins job for
> PR change and manual triggers? It would clutter up the jenkins interface a
> bit, but they could both post status to the same commitStatusContext on
> Github, so no one would notice there.
>
> Andrew
>
> On Wed, Jun 13, 2018 at 11:14 PM Jason Kuster 
> wrote:
>
>> Having submitted a patch to the ghprb-plugin repo before, I think
>> that regretfully option (b) is probably the right decision here given 
>> that
>> it's unlikely to get accepted, merged, released, and to have Infra update
>> the plugin in under a week.
>>
>> On Wed, Jun 13, 2018 at 10:42 PM Scott Wegner 
>> wrote:
>>
>>> Indeed, I was going to send out an email about pre-commit filtering,
>>> but we've already found some kinks and may need to revert it.
>>>
>>> The change was submitted in PR#5611 [1] and enables Jenkins
>>> triggering to only run pre-commits based on modified files. However, Udi
>>> noticed that this also prevents manually running pre-commits on a PR 
>>> with
>>> trigger phrases when your PR changes don't match the pre-commit include
>>> path [2]. This was blocking 2.5.0 release validation, so I have a PR 
>>> out to
>>> revert the change [3].
>>>
>>> I did some investigation and this is a deficiency in the Jenkins
>>> plugin used to trigger jobs on pull requests. I've filed a bug [4] and
>>> submitted a PR [5], but there's no guarantee that it'll get accepted or
>>> when it will be available.
>>>
>>> Question for others: we were hoping to enable pre-commit triggering
>>> as an optimization to decrease testing wait time and limit the impact of
>>> test flakiness [6]. But this bug in the plugin means we'd lose the 
>>> ability
>>> to manually trigger pre-commits which aren't automatically run. One
>>> workaround would be to run the tests locally instead of on Jenkins, 
>>> though
>>> that's clearly less desirable. Is this a blocker?
>>>
>>> Should we:
>>> (a) Keep pre-commit triggering enabled for now and hope the upstream
>>> patch gets accepted, or
>>> (b) Revert the pre-commit change and wait for the patch
>>>
>>> Thoughts?
>>>
>>> [1] https://github.com/apache/beam/pull/5611
>>> [2] https://github.com/apache/beam/pull/5607#issuecomment-397080770
>>> [3] https://github.com/apache/beam/pull/5638
>>> [4] https://github.com/jenkinsci/ghprb-plugin/issues/678
>>> [5] https://github.com/jenkinsci/ghprb-plugin/pull/680
>>> [6]
>>> https://docs.google.com/document/d/1lfbMhdIyDzIaBTgc9OUByhSwR94kfOzS_ozwKWTVl5U/edit#bookmark=id.6j8bwxnbp7fr
>>>
>>>
>>> On Wed, Jun 13, 2018 at 10:03 PM Rui Wang  wrote:
>>>
 Precommit filter is a really coool optimization!

 -Rui

 On Wed, Jun 13, 2018 at 5:21 PM Andrew Pilloud 
 wrote:

> Ah, so this is intended and I didn't break anything? Cool! Sorry
> for the false alarm, looks like a great build optimization!
>
> Andrew
>
> On Wed, Jun 13, 2018 at 5:06 PM Yifan Zou 
> wrote:
>
>> Probably due to the precommit filter applied in #5611
>> ?
>>
>> On Wed, Jun 13, 2018 at 5:02 PM Andrew Pilloud <
>> apill...@google.com> wrote:
>>
>>> Looks like statuses got posted between me writing this email and
>>> sending it. Still wondering why the python and go jobs appear to be 
>>> missing?
>>>
>>> Andrew
>>>
>>> On Wed, Jun

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #2

2018-06-19 Thread Udi Meiri

+1

On Mon, Jun 18, 2018 at 6:16 PM Boyuan Zhang  wrote:

> Follow up with pervious email( sorry for the inconvenience).
>
> Hey JB,
>
> If you haven't built python wheels yet, then you can use files in the
> previous email to move forward. If you have done all of them, then please
> ignore previous email.
>
> Thanks so much for your help.
>
> Boyuan
>
> On Mon, Jun 18, 2018 at 6:08 PM Boyuan Zhang  wrote:
>
>> Hey JB,
>>
>> I built python wheels and you can get them here:
>> https://storage.googleapis.com/python_wheels/python_wheels.zip. You
>> probably need to sign them and upload them into
>> https://dist.apache.org/repos/dist/dev/beam/2.5.0/ like what we did for
>> 2.4.0: https://dist.apache.org/repos/dist/dev/beam/2.4.0/ . After you
>> upload, we can do more validations on python wheels.
>>
>> Boyuan
>>
>> On Mon, Jun 18, 2018 at 4:49 PM Pablo Estrada  wrote:
>>
>>> +0.9
>>>
>>> I've run the Mobile Gaming examples in Java+Dataflow, and others have
>>> documented their tests in the spreadsheet[1]. All seems to work well.
>>> We will need the wheel files to be available, and run a few validations
>>> on them, but other than that, the release looks good to me.
>>>
>>> Best
>>> -P.
>>>
>>> [1]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
>>>
>>> On Mon, Jun 18, 2018 at 10:08 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi,

 I thought I did it, let me double check.

 Regards
 JB

 On 18/06/2018 19:01, Boyuan Zhang wrote:
 > Hey JB,
 >
 > Could you please also build python wheels for RC2 and stage them in
 > https://dist.apache.org/repos/dist/dev/beam/2.5.0/? There are some
 > instructions in this PR: https://github.com/apache/beam-site/pull/467
 .
 > Hope it can be helpful.
 >
 > Thanks.
 > Boyuan
 >
 > On Sun, Jun 17, 2018 at 9:22 PM Jean-Baptiste Onofré >>> > > wrote:
 >
 > s/add/had/g
 >
 > On 17/06/2018 19:32, Pablo Estrada wrote:
 > > Going to https://github.com/apache/beam/tree/v2.5.0-RC2 gives
 me a 404
 > > error. May need to push the label?
 > > Best
 > > -P.
 > >
 > > On Sun, Jun 17, 2018 at 1:15 AM Romain Manni-Bucau
 > > mailto:rmannibu...@gmail.com>
 > >>
 wrote:
 > >
 > > +1 (same tests than for take #1)
 > >
 > > Le dim. 17 juin 2018 07:18, Jean-Baptiste Onofré
 > mailto:j...@nanthrax.net>
 > > >> a
 écrit :
 > >
 > > Hi everyone,
 > >
 > > Please review and vote on the release candidate #2 for
 the
 > version
 > > 2.5.0, as follows:
 > >
 > > [ ] +1, Approve the release
 > > [ ] -1, Do not approve the release (please provide
 specific
 > > comments)
 > >
 > > NB: this is the first release using Gradle, so don't be
 too
 > > harsh ;) A
 > > PR about the release guide will follow thanks to this
 release.
 > >
 > > The complete staging area is available for your review,
 which
 > > includes:
 > > * JIRA release notes [1],
 > > * the official Apache source release to be deployed to
 > > dist.apache.org 
 > 
 > > [2], which is signed with the key with fingerprint
 > C8282E76 [3],
 > > * all artifacts to be deployed to the Maven Central
 > Repository [4],
 > > * source code tag "v2.5.0-RC2" [5],
 > > * website pull request listing the release and
 publishing
 > the API
 > > reference manual [6].
 > > * Java artifacts were built with Gradle 4.7 (wrapper)
 and
 > > OpenJDK/Oracle
 > > JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
 > > * Python artifacts are deployed along with the source
 > release to the
 > > dist.apache.org 
 >  [2].
 > >
 > > The vote will be open for at least 72 hours. It is
 adopted by
 > > majority
 > > approval, with at least 3 PMC affirmative votes.
 > >
 > > Thanks,
 > > JB
 > >
 > > [1]
 > >
 >
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847
 > > [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
 > > [3]

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-12 Thread Udi Meiri

-1: Would like to cherry pick a fix for
https://issues.apache.org/jira/browse/BEAM-4536

On Tue, Jun 12, 2018 at 9:48 AM Jean-Baptiste Onofré 
wrote:

> Any update about your vote and fix ?
>
> Thanks !
> Regards
> JB
>
> On 12/06/2018 04:02, Udi Meiri wrote:
> > Another bug: reading from PubSub with_attributes=True is broken on
> > Python with Dataflow.
> > https://issues.apache.org/jira/browse/BEAM-4536
> >
> > JB, I'm making a PR that removes this keyword and I'd like to propose it
> > as a cherrypick to 2.5.0.
> > (feature should be fixed in the next release)
> >
> > On Mon, Jun 11, 2018 at 6:19 PM Chamikara Jayalath  > <mailto:chamik...@google.com>> wrote:
> >
> > FYI: looks like Python tests are failing for Windows. JIRA
> > is https://issues.apache.org/jira/browse/BEAM-4535.
> >
> > I don't think this is a release blocker but this should probably go
> > in release notes (for any user that tries to run tests on Python
> > source build). And we should try to incorporate a fix if we happen
> > to cut another release candidate for some reason.
> >
> > Thanks,
> > Cham
> >
> > On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada  > <mailto:pabl...@google.com>> wrote:
> >
> > Thanks everyone who has pitched in to validate the release!
> >
> > Boyuan Zhang and I have also run a few pipelines, and verified
> > that they work properly (see release validation spreadsheet[1]).
> >
> > We have also found that the Game Stats pipeline is failing in
> > Python Streaming Dataflow. I have filed BEAM-4534[2]. This is
> > not a blocker, since Python streaming is not yet fully supported.
> >
> > It seems that the uploaded artifacts look good.
> >
> > We have noticed that the Python artifacts are still missing
> > Python wheel files (compare [3] and [4]). JB, could you please
> > add the wheel files? Boyuan and I can try to help you prepare
> > them / upload them if necessary. Please let us know.
> >
> > Thanks again!
> > -P.
> >
> > [1]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
> > [2] https://issues.apache.org/jira/browse/BEAM-4534
> > [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> > [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> >
> > On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold
> > mailto:amyrv...@google.com>> wrote:
> >
> > +1 (non-binding)
> >
> > tested some of the quickstarts
> >
> > On Sun, Jun 10, 2018 at 1:39 AM Tim
> >  > <mailto:timrobertson...@gmail.com>> wrote:
> >
> > Tested by our team:
> > - mvn inclusion
> > - Avro, ES, Hadoop IF IO
> > - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
> > - Reviewed release notes
> >
> > +1
> >
> > Thanks also to everyone who helped get over the gradle
> > hurdle and in particular to JB.
> >
> > Tim
> >
> > > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> > >
> > > No problem Pablo.
> > >
> > > The vote period is a minimum, it can be extended as
> > requested or if we
> > > don't have the minimum of 3 binding votes.
> > >
> > > Regards
> > > JB
> > >
> > >> On 09/06/2018 01:54, Pablo Estrada wrote:
> > >> Hello all,
> > >> I'd like to request an extension of the voting period
> > until Monday
> > >> evening (US time, so later in other geographical
> > regions). This is
> > >> because we were only now able to publish Dataflow
> > Workers, and have not
> > >> had the chance to run release validation tests on
> > them. The extension
> > >> will allow us to validate and vote by Monday.
> >

Re: Reducing Committer Load for Code Reviews

2018-05-30 Thread Udi Meiri

I thought this was the norm already? I have been the sole reviewer a few
PRs by committers and I'm only a contributor.

+1

On Wed, May 30, 2018 at 2:13 PM Kenneth Knowles  wrote:

> ++1
>
> This is good reasoning. If you trust someone with the committer
> responsibilities [1] you should trust them to find an appropriate reviewer.
>
> Also:
>
>  - adds a new way for non-committers and committers to bond
>  - makes committers seem less like gatekeepers because it goes both ways
>  - might help clear PR backlog, improving our community response latency
>  - encourages committers to code*
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/
>
> *With today's system, if a committer and a few non-committers are working
> together, then when the committer writes code it is harder to get it merged
> because it takes an extra committer. It is easier to have non-committers
> write all the code and the committer just does reviews. It is 1 committer
> vs 2 being involved. This used to be fine when almost everyone was a
> committer and all working on the core, but it is not fine any more.
>
> On Wed, May 30, 2018 at 12:50 PM Thomas Groh  wrote:
>
>> Hey all;
>>
>> I've been thinking recently about the process we have for committing
>> code, and our current process. I'd like to propose that we change our
>> current process to require at least one committer is present for each code
>> review, but remove the need to have a second committer review the code
>> prior to submission if the original contributor is a committer.
>>
>> Generally, if we trust someone with the ability to merge code that
>> someone else has written, I think it's sensible to also trust them to
>> choose a capable reviewer. We expect that all of the people that we have
>> recognized as committers will maintain the project's quality bar - and
>> that's true for both code they author and code they review. Given that, I
>> think it's sensible to expect a committer will choose a reviewer who is
>> versed in the component they are contributing to who can provide insight
>> and will also hold up the quality bar.
>>
>> Making this change will help spread the review load out among regular
>> contributors to the project, and reduce bottlenecks caused by committers
>> who have few other committers working on their same component. Obviously,
>> this requires that committers act with the best interests of the project
>> when they send out their code for reviews - but this is the behavior we
>> demand before someone is recognized as a committer, so I don't see why that
>> would be cause for concern.
>>
>> Yours,
>>
>> Thomas
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [VOTE] Use probot/stale to automatically manage stale pull requests

2018-06-01 Thread Udi Meiri

+1

On Fri, Jun 1, 2018 at 4:27 PM Lukasz Cwik  wrote:

> +1
>
> On Fri, Jun 1, 2018 at 2:53 PM Thomas Weise  wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 2:17 PM, Robert Bradshaw 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 1:43 PM Andrew Pilloud 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 1:31 PM Huygaa Batsaikhan 
 wrote:

> +1
>
> On Fri, Jun 1, 2018 at 1:17 PM Henning Rohde 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 10:16 AM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Jun 1, 2018 at 10:05 AM Kenneth Knowles 
>>> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 9:54 AM Scott Wegner 
 wrote:

> +1 (non-binding)
>
> On Fri, Jun 1, 2018 at 9:39 AM Ahmet Altay 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018, 9:32 AM Jason Kuster 
>> wrote:
>>
>>> +1 (non-binding): automating policy ensures it is applied fairly
>>> and evenly and lessens the load on project maintainers; hearty 
>>> agreement.
>>>
>>> On Fri, Jun 1, 2018 at 9:25 AM Alan Myrvold 
>>> wrote:
>>>
 +1 (non-binding) I updated the pull request to be 60 days
 (instead of 90) to match the contribute policy.

 On Fri, Jun 1, 2018 at 9:21 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> Following the discussion, please vote on the move to activate
> probot/stale [3] to notify authors of stale PRs per current
> policy and then close them after a 7 day grace period.
>
> For more details, see:
>
>  - our stale PR policy [1]
>  - the discussion thread [2]
>  - Probot stale [3]
>  - BEAM ticket summarizing discussion [4]
>  - INFRA ticket to activate probot/stale [5]
>  - Example PR that would activate it [6]
>
> Please vote:
> [ ] +1, Approve that we activate probot/stale
> [ ] -1, Do not approve (please provide specific comments)
>
> Kenn
>
> [1] https://beam.apache.org/contribute/#stale-pull-requests
> [2]
> https://lists.apache.org/thread.html/bda552ea7073ca165aaf47034610afafe22d589e386525023d33609e@%3Cdev.beam.apache.org%3E
> [3] https://github.com/probot/stale
> [4] https://issues.apache.org/jira/browse/BEAM-4423
> [5] https://issues.apache.org/jira/browse/INFRA-16589
> [6] https://github.com/apache/beam/pull/5532
>

>>>
>>> --
>>> ---
>>> Jason Kuster
>>> Apache Beam / Google Cloud Dataflow
>>>
>>> See something? Say something. go/jasonkuster-feedback
>>> 
>>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Proposal: keeping precommit times fast

2018-06-05 Thread Udi Meiri

uicker immediate signal, plus more
>>> extensive tests before actually merging. It's also nice to not have to
>>> wait
>>> an hour to see that you have a lint error; quick stuff like that could be
>>> signaled quickly before a contributor looses context.
>>>
>>> >>>>> - Robert
>>>
>>>
>>>
>>> >>>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles 
>>> wrote:
>>>
>>> >>>>>> I like the idea. I think it is a good time for the project to
>>> start
>>> tracking this and keeping it usable.
>>>
>>> >>>>>> Certainly 2 hours is more than enough, is that not so? The Java
>>> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
>>> fast it doesn't matter. Do we have enough stragglers that we don't make
>>> it
>>> in the 95th percentile? Is the time spent in the Jenkins queue?
>>>
>>> >>>>>> For our current coverage, I'd be willing to go for:
>>>
>>> >>>>>>- 1 hr hard cap (someone better at stats could choose %ile)
>>> >>>>>>- roll back or remove test from precommit if fix looks like
>>> more
>>> than 1 week (roll back if it is perf degradation, remove test from
>>> precommit if it is additional coverage that just doesn't fit in the time)
>>>
>>> >>>>>> There's a longer-term issue that doing a full build each time is
>>> expected to linearly scale up with the size of our repo (it is the
>>> monorepo
>>> problem but for a minirepo) so there is no cap that is feasible until we
>>> have effective cross-build caching. And my long-term goal would be <30
>>> minutes. At the latency of opening a pull request and then checking your
>>> email that's not burdensome, but an hour is.
>>>
>>> >>>>>> Kenn
>>>
>>> >>>>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri 
>>> wrote:
>>>
>>> >>>>>>> HI,
>>> >>>>>>> I have a proposal to improve contributor experience by keeping
>>> precommit times low.
>>>
>>> >>>>>>> I'm looking to get community consensus and approval about:
>>> >>>>>>> 1. How long should precommits take. 2 hours @95th percentile over
>>> the past 4 weeks is the current proposal.
>>> >>>>>>> 2. The process for dealing with slowness. Do we: fix, roll back,
>>> remove a test from precommit?
>>> >>>>>>> Rolling back if a fix is estimated to take longer than 2 weeks is
>>> the current proposal.
>>>
>>>
>>>
>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Proposal: keeping precommit times fast

2018-06-06 Thread Udi Meiri

To follow up on the Jenkins Job Cacher Plugin:

Using a Jenkins plugin to save and reuse the Gradle cache for successive
precommit jobs.
The problem with this approach is that the precommit runs that a Jenkins
server runs are unrelated.
Say you have 2 PRs, A and B, and the precommit job for B reuses the cache
left by the job for A.
The diff between the two will cause tests affected both by A and B to be
rerun (at least).
If A modifies Python code, then the job for B must rerun ALL Python tests
(since Gradle doesn't do dependency tracking for Python).

Proposal:
a. The cache plugin is still useful for successive Java precommit jobs, but
not for Python. (Go, I have no idea)
We could use it exclusively for Java precommits.
b. To avoid running precommit jobs for code not touched by a PR, look at
the paths of files changed.
For example, a PR touching only files under sdks/python/... need only run
Python precommit tests.

On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri  wrote:

> I've been having a separate discussion on the proposal doc, which is ready
> for another round of reviews.
> Change summary:
> - Changed fast requirement to be < 30 minutes and simplify the check as an
> aggregate for each precommit job type.
> - Updated slowness notification methods to include automated methods: as a
> precommit check result type on GitHub, as a bug.
> - Merged in the metrics design doc.
> - Added detailed design section.
> - Added list of deliverables.
>
> What I would like is consensus regarding:
> - How fast we want precommit runs to be. I propose 30m.
> - Deadline for fixing a slow test before it is temporarily removed from
> precommit. I propose 24 hours.
>
>
> Replying to the thread:
>
> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
> unaffected tests (BEAM-4400).
>
> 2. Java Precommit tests include integration tests (example
> <https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/>
> ).
> We could split these out to get much faster results, i.e., a separate
> precommit just for basic integration tests (which will still need to run in
> <30m).
> Perhaps lint checks for Python could be split out as well.
>
> I'll add these suggestions to the doc tomorrow.
>
> On Thu, May 24, 2018 at 9:25 AM Scott Wegner  wrote:
>
>> So, it sounds like there's agreement that we should improve precommit
>> times by only running necessary tests, and configuring Jenkins Job
>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>> [1] to follow-up on this.
>>
>> Getting back to Udi's original proposal [2]: I see value in defining a
>> metric and target for overall pre-commit timing. The proposal for an
>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>> it, but if we drift to a point where we're not, that should trigger some
>> action to be taken to get back to a healthy state.
>>
>> I wouldn't mind separately setting a more aspiration goal of getting the
>> pre-commits even faster (i.e. 15-30 mins), but I suspect that would require
>> a concerted effort to evaluate and improve existing tests across the
>> codebase. One idea would be to set up ensure the metric reporting can show
>> the trend, and which tests are responsible for the most walltime, so that
>> we know where to invest any efforts to improve tests.
>>
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>> [2]
>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>
>>
>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles  wrote:
>>
>>> With regard to the Job Cacher Plugin: I think it is an infra ticket to
>>> install? And I guess we need it longer term when we move to containerized
>>> builds anyhow? One thing I've experienced with the Travis-CI cache is that
>>> the time spent uploading & downloading the remote cache - in that case of
>>> all the pip installed dependencies - negated the benefits. Probably for
>>> Beam it will have a greater benefit if we can skip most of the build.
>>>
>>> Regarding integration tests in precommit: I think it is OK to run maybe
>>> one Dataflow job in precommit, but it should be in parallel with the unit
>>> tests and just a smoke test that takes 5 minutes, not a suite that takes 35
>>> minutes. So IMO that is low-hanging fruit. If this would make postcommit
>>> unstable, then it also means precommit is unstable. Both are troublesome.
>>>
>>> More short term, some possible hacks:
>>>
>>>  - Point gradle to cache outside the

Filesystems.copy and .rename behavior

2018-01-30 Thread Udi Meiri

Hi,
I've been working on HDFS code for the Python SDK and I've noticed some
behaviors which are surprising. I wanted to know if these behaviors are
known and intended.

1. When renaming files during finalize_write, rename errors are ignored
.
For example, if I run wordcount twice using HDFS code I get a warning the
second time because the file already exists:

WARNING:root:Rename not successful:
hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
-> hdfs://counts2-0-of-1, libhdfs error in renaming
hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
to hdfs://counts2-0-of-1 with exceptions Unable to rename
'/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
to '/counts2-0-of-1'.

For GCS and local files there are no rename errors (in this case), since
the rename operation silently overwrites existing destination files.
However, blindly ignoring these errors might make the pipeline to report
success even though output files are missing.

2. Output files (--ouput) overwrite existing files.

3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't use
Filesystem.Rename().

Thanks,
- Udi


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Udi Meiri

For GCS, I would do what I believe we already do.
rename(src, dst):
- if !exists(src) and exists(dst) return 0
- if !exists(src) and !exists(dst) return error
- if exists(src) and exists(dst) { if checksum(src) == checksum(dst) return
0 else delete(dst) }
- Start a GCS copy from src to dst.
- Wait for GCS copy to complete.
- delete(src)

For filesystems that don't have checksum() metadata, size() can be used
instead.

I've opened a bug to track this:
https://issues.apache.org/jira/browse/BEAM-3600

On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files
> that are missing for more ominous reasons than just this being a non-first
> attempt at renaming src to dst. E.g. if there was a bug in constructing the
> filename to be renamed, or if we somehow messed up the order of rename vs
> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead to
> silent data loss (likely caught by unit tests though - so this is not a
> super serious issue).
>
> Basically I just can't think of a case when I was copying files and
> thinking "oh man, I wish it didn't give an error if the stuff I'm copying
> doesn't exist" - the option exists only because we couldn't come up with
> another way to implement idempotent rename on GCS.
>
> What's your idea of how a safe retryable GCS rename() could work?
>
> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri <eh...@google.com> wrote:
>
>> Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
>> unsafe because it will skip (src, dst) pairs where neither exist? (it only
>> looks if src exists)
>>
>> For GCS, we can construct a safe retryable rename() operation, assuming
>> that copy() and delete() are atomic for a single file or pair of files.
>>
>>
>>
>> On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi <rang...@google.com> wrote:
>>
>>> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> As far as I know, the current implementation of file sinks is the only
>>>> reason why the flag IGNORE_MISSING for copying even exists - there's no
>>>> other compelling reason to justify it. We implement "rename" as "copy, then
>>>> delete" (in a single DoFn), so for idempodency of this operation we need to
>>>> ignore the copying of a non-existent file.
>>>>
>>>> I think the right way to go would be to change the implementation of
>>>> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>>>> it's made of 2 individually idempotent operations:
>>>> 1) copy, which would fail if input is missing, and would overwrite
>>>> output if it exists
>>>> -- reshuffle --
>>>> 2) delete, which would not fail if input is missing.
>>>>
>>>
>>> Something like this is needed only in streaming, right?
>>>
>>> Raghu.
>>>
>>>
>>>> That way first everything is copied (possibly via multiple attempts),
>>>> and then old files are deleted (possibly via multiple attempts).
>>>>
>>>> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> I agree that overwriting is more in line with user expectations.
>>>>> I believe that the sink should not ignore errors from the filesystem
>>>>> layer. Instead, the FileSystem API should be more well defined.
>>>>> Examples: rename() and copy() should overwrite existing files at the
>>>>> destination, copy() should have an ignore_missing flag.
>>>>>
>>>>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi <rang...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Original mail mentions that output from second run of word_count is
>>>>>> ignored. That does not seem as safe as ignoring error from a second 
>>>>>> attempt
>>>>>> of a step. How do we know second run didn't run on different output?
>>>>>> Overwriting seems more accurate than ignoring. Does handling this error 
>>>>>> at
>>>>>> sink level distinguish between the two (another run vs second attempt)?
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri <eh...@google.com> wrote:
>>>>>>
>>>>>>> Yeah, another round of refactoring is due to move the rename via
>>>>>>> copy+delete logic

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri

I agree that overwriting is more in line with user expectations.
I believe that the sink should not ignore errors from the filesystem layer.
Instead, the FileSystem API should be more well defined.
Examples: rename() and copy() should overwrite existing files at the
destination, copy() should have an ignore_missing flag.

On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi <rang...@google.com> wrote:

> Original mail mentions that output from second run of word_count is
> ignored. That does not seem as safe as ignoring error from a second attempt
> of a step. How do we know second run didn't run on different output?
> Overwriting seems more accurate than ignoring. Does handling this error at
> sink level distinguish between the two (another run vs second attempt)?
>
>
> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri <eh...@google.com> wrote:
>
>> Yeah, another round of refactoring is due to move the rename via
>> copy+delete logic up to the file-based sink level.
>>
>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>> Good point. There's always the chance of step that performs final rename
>>> being retried. So we'll have to ignore this error at the sink level. We
>>> don't necessarily have to do this at the FileSystem level though. I think
>>> the proper behavior might be to raise an error for the rename at the
>>> FileSystem level if the destination already exists (or source doesn't
>>> exist) while ignoring that error (and possibly logging a warning) at the
>>> sink level.
>>>
>>> - Cham
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> I think the idea was to ignore "already exists" errors. The reason
>>>> being that any step in Beam can be executed multiple times, including the
>>>> rename step. If the rename step gets run twice, the second run should
>>>> succeed vacuously.
>>>>
>>>>
>>>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> Hi,
>>>>> I've been working on HDFS code for the Python SDK and I've noticed
>>>>> some behaviors which are surprising. I wanted to know if these behaviors
>>>>> are known and intended.
>>>>>
>>>>> 1. When renaming files during finalize_write, rename errors are
>>>>> ignored
>>>>> <https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
>>>>> For example, if I run wordcount twice using HDFS code I get a warning the
>>>>> second time because the file already exists:
>>>>>
>>>>> WARNING:root:Rename not successful:
>>>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>>>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>>>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>>>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>>>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>>>>> to '/counts2-0-of-1'.
>>>>>
>>>>> For GCS and local files there are no rename errors (in this case),
>>>>> since the rename operation silently overwrites existing destination files.
>>>>> However, blindly ignoring these errors might make the pipeline to report
>>>>> success even though output files are missing.
>>>>>
>>>>> 2. Output files (--ouput) overwrite existing files.
>>>>>
>>>>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>>>>> use Filesystem.Rename().
>>>>>
>>>>> Thanks,
>>>>> - Udi
>>>>>
>>>>
>>>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri

Yeah, another round of refactoring is due to move the rename via
copy+delete logic up to the file-based sink level.

On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath <chamik...@google.com> wrote:

> Good point. There's always the chance of step that performs final rename
> being retried. So we'll have to ignore this error at the sink level. We
> don't necessarily have to do this at the FileSystem level though. I think
> the proper behavior might be to raise an error for the rename at the
> FileSystem level if the destination already exists (or source doesn't
> exist) while ignoring that error (and possibly logging a warning) at the
> sink level.
>
> - Cham
>
>
> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax <re...@google.com> wrote:
>
>> I think the idea was to ignore "already exists" errors. The reason being
>> that any step in Beam can be executed multiple times, including the rename
>> step. If the rename step gets run twice, the second run should succeed
>> vacuously.
>>
>>
>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <eh...@google.com> wrote:
>>
>>> Hi,
>>> I've been working on HDFS code for the Python SDK and I've noticed some
>>> behaviors which are surprising. I wanted to know if these behaviors are
>>> known and intended.
>>>
>>> 1. When renaming files during finalize_write, rename errors are ignored
>>> <https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
>>> For example, if I run wordcount twice using HDFS code I get a warning the
>>> second time because the file already exists:
>>>
>>> WARNING:root:Rename not successful:
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> -> hdfs://counts2-0-of-1, libhdfs error in renaming
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> to hdfs://counts2-0-of-1 with exceptions Unable to rename
>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>>> to '/counts2-0-of-1'.
>>>
>>> For GCS and local files there are no rename errors (in this case), since
>>> the rename operation silently overwrites existing destination files.
>>> However, blindly ignoring these errors might make the pipeline to report
>>> success even though output files are missing.
>>>
>>> 2. Output files (--ouput) overwrite existing files.
>>>
>>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>>> use Filesystem.Rename().
>>>
>>> Thanks,
>>> - Udi
>>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri

Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
unsafe because it will skip (src, dst) pairs where neither exist? (it only
looks if src exists)

For GCS, we can construct a safe retryable rename() operation, assuming
that copy() and delete() are atomic for a single file or pair of files.



On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi <rang...@google.com> wrote:

> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> As far as I know, the current implementation of file sinks is the only
>> reason why the flag IGNORE_MISSING for copying even exists - there's no
>> other compelling reason to justify it. We implement "rename" as "copy, then
>> delete" (in a single DoFn), so for idempodency of this operation we need to
>> ignore the copying of a non-existent file.
>>
>> I think the right way to go would be to change the implementation of
>> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>> it's made of 2 individually idempotent operations:
>> 1) copy, which would fail if input is missing, and would overwrite output
>> if it exists
>> -- reshuffle --
>> 2) delete, which would not fail if input is missing.
>>
>
> Something like this is needed only in streaming, right?
>
> Raghu.
>
>
>> That way first everything is copied (possibly via multiple attempts), and
>> then old files are deleted (possibly via multiple attempts).
>>
>> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri <eh...@google.com> wrote:
>>
>>> I agree that overwriting is more in line with user expectations.
>>> I believe that the sink should not ignore errors from the filesystem
>>> layer. Instead, the FileSystem API should be more well defined.
>>> Examples: rename() and copy() should overwrite existing files at the
>>> destination, copy() should have an ignore_missing flag.
>>>
>>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi <rang...@google.com> wrote:
>>>
>>>> Original mail mentions that output from second run of word_count is
>>>> ignored. That does not seem as safe as ignoring error from a second attempt
>>>> of a step. How do we know second run didn't run on different output?
>>>> Overwriting seems more accurate than ignoring. Does handling this error at
>>>> sink level distinguish between the two (another run vs second attempt)?
>>>>
>>>>
>>>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> Yeah, another round of refactoring is due to move the rename via
>>>>> copy+delete logic up to the file-based sink level.
>>>>>
>>>>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath <chamik...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Good point. There's always the chance of step that performs final
>>>>>> rename being retried. So we'll have to ignore this error at the sink 
>>>>>> level.
>>>>>> We don't necessarily have to do this at the FileSystem level though. I
>>>>>> think the proper behavior might be to raise an error for the rename at 
>>>>>> the
>>>>>> FileSystem level if the destination already exists (or source doesn't
>>>>>> exist) while ignoring that error (and possibly logging a warning) at the
>>>>>> sink level.
>>>>>>
>>>>>> - Cham
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> I think the idea was to ignore "already exists" errors. The reason
>>>>>>> being that any step in Beam can be executed multiple times, including 
>>>>>>> the
>>>>>>> rename step. If the rename step gets run twice, the second run should
>>>>>>> succeed vacuously.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <eh...@google.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I've been working on HDFS code for the Python SDK and I've noticed
>>>>>>>> some behaviors which are surprising. I wanted to know if these 
>>>>>>>> behaviors
>>>>>>>> are known and intended.
>>>>>>>>
>>>>>>>> 1. When renami

Re: Removing documentation for old Beam versions

2018-08-02 Thread Udi Meiri

The older docs are not directly linked to and are in Github commit history.

If there are no objections I'm going to delete javadocs and pydocs for
releases older than 1 year,
meaning 2.0.0 and older (going by the dates here
<https://beam.apache.org/get-started/downloads/>).

On Thu, Aug 2, 2018 at 11:51 AM Daniel Oliveira 
wrote:

> The older docs should be recorded in the commit history of the website
> repository, right? If they're not currently used in the website and they're
> in the commit history then I don't see a reason to save them.
>
> On Tue, Jul 31, 2018 at 1:51 PM Udi Meiri  wrote:
>
>> Hi all,
>> I'm writing a PR for apache/beam-site and beam_PreCommit_Website_Stage is
>> timing out after 100 minutes, because it's trying to deletes 22k files and
>> then copy 22k files (warning large file
>> <https://builds.apache.org/job/beam_PreCommit_Website_Stage/1276/consoleText>
>> ).
>>
>> It seems that we could save a lot of time by deleting the older javadoc
>> and pydoc files for older versions. Is there a good reason to keep around
>> this kind of documentation for older versions (say 1 year back)?
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Removing documentation for old Beam versions

2018-08-02 Thread Udi Meiri

Pablo, the docs are generated into versioned paths, e.g.,
https://beam.apache.org/documentation/sdks/javadoc/2.5.0/ so tags are not
necessary?
Also, once apache/beam-site is merged with apache/beam the release branch
should have the relevant docs (although perhaps it's better to put them in
a different repo or storage system).

Thomas, I would very much like to not have javadoc/pydoc generation be part
of the website review process, as it takes up a lot of time when changes
are staged (10s of thousands of files), especially when a PR is updated and
existing staged files need to be deleted.


On Thu, Aug 2, 2018 at 1:15 PM Mikhail Gryzykhin  wrote:

> +1 For removing old documentation.
>
> @Thomas: Migration work is in backlog and will be picked up in near time.
>
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Thu, Aug 2, 2018 at 12:54 PM Thomas Weise  wrote:
>
>> +1 for removing pre 2.0 documentation (as well as the entries from
>> https://beam.apache.org/get-started/downloads/)
>>
>> Isn't it part of the beam-site changes that we will no longer check in
>> generated documentation into the repository? Those can be generated and
>> deployed independently (when a commit to a branch occurs), such as done in
>> the Apex and Flink projects.
>>
>> I was told that Scott who was working in the beam-site changes is on
>> leave now and the migration is still pending (see note at
>> https://github.com/apache/beam/tree/master/website). Is anyone else
>> going to pick it up?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Thu, Aug 2, 2018 at 12:33 PM Pablo Estrada  wrote:
>>
>>> Is it worth adding a tag / branch to the repositories every time we make
>>> a release, so that people are able to dive in and find the docs?
>>> Best
>>> -P.
>>>
>>> On Thu, Aug 2, 2018 at 12:09 PM Ahmet Altay  wrote:
>>>
>>>> I would guess that users are still using some of these old releases. It
>>>> is unclear from Beam website which releases are still supported or not. It
>>>> probably makes sense to drop documentation for releases < 2.0. (I would
>>>> suggest keeping docs for 2.0). For the future I can work on updating the
>>>> Beam website to clarify the state of each release.
>>>>
>>>> On Thu, Aug 2, 2018 at 12:06 PM, Udi Meiri  wrote:
>>>>
>>>>> The older docs are not directly linked to and are in Github commit
>>>>> history.
>>>>>
>>>>> If there are no objections I'm going to delete javadocs and pydocs for
>>>>> releases older than 1 year,
>>>>> meaning 2.0.0 and older (going by the dates here
>>>>> <https://beam.apache.org/get-started/downloads/>).
>>>>>
>>>>> On Thu, Aug 2, 2018 at 11:51 AM Daniel Oliveira <
>>>>> danolive...@google.com> wrote:
>>>>>
>>>>>> The older docs should be recorded in the commit history of the
>>>>>> website repository, right? If they're not currently used in the website 
>>>>>> and
>>>>>> they're in the commit history then I don't see a reason to save them.
>>>>>>
>>>>>> On Tue, Jul 31, 2018 at 1:51 PM Udi Meiri  wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> I'm writing a PR for apache/beam-site and
>>>>>>> beam_PreCommit_Website_Stage is timing out after 100 minutes, because 
>>>>>>> it's
>>>>>>> trying to deletes 22k files and then copy 22k files (warning large
>>>>>>> file
>>>>>>> <https://builds.apache.org/job/beam_PreCommit_Website_Stage/1276/consoleText>
>>>>>>> ).
>>>>>>>
>>>>>>> It seems that we could save a lot of time by deleting the older
>>>>>>> javadoc and pydoc files for older versions. Is there a good reason to 
>>>>>>> keep
>>>>>>> around this kind of documentation for older versions (say 1 year back)?
>>>>>>>
>>>>>>
>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Removing documentation for old Beam versions

2018-08-02 Thread Udi Meiri

[image: pr-520.png]
(trying that image again)

On Thu, Aug 2, 2018 at 7:00 PM Udi Meiri  wrote:

> Alright, created https://github.com/apache/beam-site/pull/520
> [image: pr-520.png]
> Reduces staging upload from 500M down to 270M, and halves the number of
> files from ~22k to 11k.
>
>
>
> On Thu, Aug 2, 2018 at 6:58 PM Pablo Estrada  wrote:
>
>> I believe tags will be necessarily because for anyone looking for old
>> docs that have been removed, they will need to browse back in history, not
>> just browse the tree of directories.
>> -P.
>>
>> On Thu, Aug 2, 2018, 6:46 PM Mikhail Gryzykhin  wrote:
>>
>>> Last time I talked with Scott I brought this idea in. I believe the plan
>>> was either to publish compiled site to website directly, or keep it in
>>> separate storage from apache/beam repo.
>>>
>>> One of the main reasons not to check in compiled version of website is
>>> that every developer will have to pull all the versions of website every
>>> time they clone repo, which is not that good of an idea to do.
>>>
>>> Regards,
>>> --Mikhail
>>>
>>> Have feedback <http://go/migryz-feedback>?
>>>
>>>
>>> On Thu, Aug 2, 2018 at 6:42 PM Udi Meiri  wrote:
>>>
>>>> Pablo, the docs are generated into versioned paths, e.g.,
>>>> https://beam.apache.org/documentation/sdks/javadoc/2.5.0/ so tags are
>>>> not necessary?
>>>> Also, once apache/beam-site is merged with apache/beam the release
>>>> branch should have the relevant docs (although perhaps it's better to put
>>>> them in a different repo or storage system).
>>>>
>>>> Thomas, I would very much like to not have javadoc/pydoc generation be
>>>> part of the website review process, as it takes up a lot of time when
>>>> changes are staged (10s of thousands of files), especially when a PR is
>>>> updated and existing staged files need to be deleted.
>>>>
>>>>
>>>> On Thu, Aug 2, 2018 at 1:15 PM Mikhail Gryzykhin 
>>>> wrote:
>>>>
>>>>> +1 For removing old documentation.
>>>>>
>>>>> @Thomas: Migration work is in backlog and will be picked up in near
>>>>> time.
>>>>>
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Thu, Aug 2, 2018 at 12:54 PM Thomas Weise  wrote:
>>>>>
>>>>>> +1 for removing pre 2.0 documentation (as well as the entries from
>>>>>> https://beam.apache.org/get-started/downloads/)
>>>>>>
>>>>>> Isn't it part of the beam-site changes that we will no longer check
>>>>>> in generated documentation into the repository? Those can be generated 
>>>>>> and
>>>>>> deployed independently (when a commit to a branch occurs), such as done 
>>>>>> in
>>>>>> the Apex and Flink projects.
>>>>>>
>>>>>> I was told that Scott who was working in the beam-site changes is on
>>>>>> leave now and the migration is still pending (see note at
>>>>>> https://github.com/apache/beam/tree/master/website). Is anyone else
>>>>>> going to pick it up?
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 2, 2018 at 12:33 PM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> Is it worth adding a tag / branch to the repositories every time we
>>>>>>> make a release, so that people are able to dive in and find the docs?
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>> On Thu, Aug 2, 2018 at 12:09 PM Ahmet Altay 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would guess that users are still using some of these old
>>>>>>>> releases. It is unclear from Beam website which releases are still
>>>>>>>> supported or not. It probably makes sense to drop documentation for
>>>>>>>> releases < 2.0. (I would suggest keeping docs for 2.0). For the future 
>>>>>>>> I
>>>>>>>> can work on updating the Beam website to clarify the state of each 
>>>>>>>> rele

Re: Removing documentation for old Beam versions

2018-08-02 Thread Udi Meiri

Alright, created https://github.com/apache/beam-site/pull/520
[image: pr-520.png]
Reduces staging upload from 500M down to 270M, and halves the number of
files from ~22k to 11k.



On Thu, Aug 2, 2018 at 6:58 PM Pablo Estrada  wrote:

> I believe tags will be necessarily because for anyone looking for old docs
> that have been removed, they will need to browse back in history, not just
> browse the tree of directories.
> -P.
>
> On Thu, Aug 2, 2018, 6:46 PM Mikhail Gryzykhin  wrote:
>
>> Last time I talked with Scott I brought this idea in. I believe the plan
>> was either to publish compiled site to website directly, or keep it in
>> separate storage from apache/beam repo.
>>
>> One of the main reasons not to check in compiled version of website is
>> that every developer will have to pull all the versions of website every
>> time they clone repo, which is not that good of an idea to do.
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Thu, Aug 2, 2018 at 6:42 PM Udi Meiri  wrote:
>>
>>> Pablo, the docs are generated into versioned paths, e.g.,
>>> https://beam.apache.org/documentation/sdks/javadoc/2.5.0/ so tags are
>>> not necessary?
>>> Also, once apache/beam-site is merged with apache/beam the release
>>> branch should have the relevant docs (although perhaps it's better to put
>>> them in a different repo or storage system).
>>>
>>> Thomas, I would very much like to not have javadoc/pydoc generation be
>>> part of the website review process, as it takes up a lot of time when
>>> changes are staged (10s of thousands of files), especially when a PR is
>>> updated and existing staged files need to be deleted.
>>>
>>>
>>> On Thu, Aug 2, 2018 at 1:15 PM Mikhail Gryzykhin 
>>> wrote:
>>>
>>>> +1 For removing old documentation.
>>>>
>>>> @Thomas: Migration work is in backlog and will be picked up in near
>>>> time.
>>>>
>>>> --Mikhail
>>>>
>>>> Have feedback <http://go/migryz-feedback>?
>>>>
>>>>
>>>> On Thu, Aug 2, 2018 at 12:54 PM Thomas Weise  wrote:
>>>>
>>>>> +1 for removing pre 2.0 documentation (as well as the entries from
>>>>> https://beam.apache.org/get-started/downloads/)
>>>>>
>>>>> Isn't it part of the beam-site changes that we will no longer check in
>>>>> generated documentation into the repository? Those can be generated and
>>>>> deployed independently (when a commit to a branch occurs), such as done in
>>>>> the Apex and Flink projects.
>>>>>
>>>>> I was told that Scott who was working in the beam-site changes is on
>>>>> leave now and the migration is still pending (see note at
>>>>> https://github.com/apache/beam/tree/master/website). Is anyone else
>>>>> going to pick it up?
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Thu, Aug 2, 2018 at 12:33 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Is it worth adding a tag / branch to the repositories every time we
>>>>>> make a release, so that people are able to dive in and find the docs?
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>> On Thu, Aug 2, 2018 at 12:09 PM Ahmet Altay  wrote:
>>>>>>
>>>>>>> I would guess that users are still using some of these old releases.
>>>>>>> It is unclear from Beam website which releases are still supported or 
>>>>>>> not.
>>>>>>> It probably makes sense to drop documentation for releases < 2.0. (I 
>>>>>>> would
>>>>>>> suggest keeping docs for 2.0). For the future I can work on updating the
>>>>>>> Beam website to clarify the state of each release.
>>>>>>>
>>>>>>> On Thu, Aug 2, 2018 at 12:06 PM, Udi Meiri  wrote:
>>>>>>>
>>>>>>>> The older docs are not directly linked to and are in Github commit
>>>>>>>> history.
>>>>>>>>
>>>>>>>> If there are no objections I'm going to delete javadocs and pydocs
>>>>>>>> for releases older than 1 year,
>>>>>>>> meaning 2

Re: CODEOWNERS for apache/beam repo

2018-07-25 Thread Udi Meiri

So I configured Prow using their getting started guide (and found a bug in
it) on a test repo.

TLDR: Prow can work for us as a review assignment tool if all potential
reviewers are also added to the https://github.com/apache org.

Some findings:
1. Github doesn't allow non-collaborators to be listed as reviewers. :(
But! anyone added to the Apache org on Github may be added as a reviewer.
(no write access needed)
Is this something the ASF is willing to consider?

2. Prow works pretty well. I've configured it to just assign code reviewers.
Here's an example of it in action:
https://github.com/udim-org/prow-test/pull/6
Essentially, the command we would use are:
"/cc @user" - to explicitly add a reviewer (/uncc to remove)

Other command in the example above are not necessary.
We can still use our current PR approval and merge process.

3. Prow currently tries to assign 2 code reviewers, and hopefully that's
configurable.

Still unsure:
1. How does Prow select reviewers? Does it load balance?

On Mon, Jul 23, 2018 at 9:51 PM Jean-Baptiste Onofré 
wrote:

> It looks interesting but I would like to see the complete video and
> explanation about prow. Especially what we concretely need.
>
> Regards
> JB
> Le 24 juil. 2018, à 04:17, Udi Meiri  a écrit:
>>
>> I was recently told about Prow
>> <https://github.com/kubernetes/test-infra/tree/master/prow>, which
>> automates testing and merging for Kubernetes and other projects.
>> It also automates assigning reviewers and suggesting approvers. Example
>> <https://github.com/kubernetes/community/pull/2381> PR, video explanation
>> <https://youtu.be/BsIC7gPkH5M?t=19m41s>
>> I propose trying out Prow, since is it's a maintained and it uses OWNERS
>> files to explicitly define both who should be reviewing and who should
>> approve a PR.
>>
>> I'm not suggesting we use it to replace Jenkins or do our merges for us.
>>
>>
>> On Tue, Jul 17, 2018 at 11:04 AM Udi Meiri  wrote:
>>
>>> +1 to generating the file.
>>> I'll go ahead and file a PR to remove CODEOWNERS
>>>
>>> On Tue, Jul 17, 2018 at 9:28 AM Holden Karau 
>>> wrote:
>>>
>>>> So it doesn’t support doing that right now, although if we find it’s a
>>>> problem we can specify an exclude file with folks who haven’t contributed
>>>> in the past year. Would people want me to generate that first?
>>>>
>>>> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía 
>>>> wrote:
>>>>
>>>>> Is there a way to put inactive people as not reviewers for the blame
>>>>> case? I think it can be useful considering that a good amount of our
>>>>> committers are not active at the moment and auto-assigning reviews to
>>>>> them seem like a waste of energy/time.
>>>>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
>>>>> wrote:
>>>>> >
>>>>> > We did not, but I think we should. So far, in 100% of the PRs I've
>>>>> authored, the default functionality of CODEOWNERS did the wrong thing and 
>>>>> I
>>>>> had to fix something up manually.
>>>>> >
>>>>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
>>>>> wrote:
>>>>> >>
>>>>> >> This sounds like a good plan. Did we want to rename the CODEOWNERS
>>>>> file to disable github's mass adding of reviewers while we figure this 
>>>>> out?
>>>>> >>
>>>>> >> Andrew
>>>>> >>
>>>>> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré <
>>>>> j...@nanthrax.net> wrote:
>>>>> >>>
>>>>> >>> +1
>>>>> >>>
>>>>> >>> Le 16 juil. 2018, à 19:17, Holden Karau 
>>>>> a écrit:
>>>>> >>>>
>>>>> >>>> Ok if no one objects I'll create the INFRA ticket after OSCON and
>>>>> we can test it for a week and decide if it helps or hinders.
>>>>> >>>>
>>>>> >>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré <
>>>>> j...@nanthrax.net> wrote:
>>>>> >>>>>
>>>>> >>>>> Agree to test it for a week.
>>>>> >>>>>
>>>>> >>>>> Regards
>>>>> >>>>> JB
>>>>> >>>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com>
>&g

Re: CODEOWNERS for apache/beam repo

2018-07-27 Thread Udi Meiri

Summary doc for CODEOWNERS, Mention-bot, Prow:
https://docs.google.com/document/d/1S8spggJsxDNYZ7aNwZN6VhLhNW372SVRezjblt-7lNQ/edit?usp=sharing
This doc will get updated as we gain experience with Mention-bot and Prow.

On Wed, Jul 25, 2018 at 5:15 PM Udi Meiri  wrote:

> So I configured Prow using their getting started guide (and found a bug in
> it) on a test repo.
>
> TLDR: Prow can work for us as a review assignment tool if all potential
> reviewers are also added to the https://github.com/apache org.
>
> Some findings:
> 1. Github doesn't allow non-collaborators to be listed as reviewers. :(
> But! anyone added to the Apache org on Github may be added as a reviewer.
> (no write access needed)
> Is this something the ASF is willing to consider?
>
> 2. Prow works pretty well. I've configured it to just assign code
> reviewers.
> Here's an example of it in action:
> https://github.com/udim-org/prow-test/pull/6
> Essentially, the command we would use are:
> "/cc @user" - to explicitly add a reviewer (/uncc to remove)
>
> Other command in the example above are not necessary.
> We can still use our current PR approval and merge process.
>
> 3. Prow currently tries to assign 2 code reviewers, and hopefully that's
> configurable.
>
> Still unsure:
> 1. How does Prow select reviewers? Does it load balance?
>
> On Mon, Jul 23, 2018 at 9:51 PM Jean-Baptiste Onofré 
> wrote:
>
>> It looks interesting but I would like to see the complete video and
>> explanation about prow. Especially what we concretely need.
>>
>> Regards
>> JB
>> Le 24 juil. 2018, à 04:17, Udi Meiri  a écrit:
>>>
>>> I was recently told about Prow
>>> <https://github.com/kubernetes/test-infra/tree/master/prow>, which
>>> automates testing and merging for Kubernetes and other projects.
>>> It also automates assigning reviewers and suggesting approvers. Example
>>> <https://github.com/kubernetes/community/pull/2381> PR, video
>>> explanation <https://youtu.be/BsIC7gPkH5M?t=19m41s>
>>> I propose trying out Prow, since is it's a maintained and it uses OWNERS
>>> files to explicitly define both who should be reviewing and who should
>>> approve a PR.
>>>
>>> I'm not suggesting we use it to replace Jenkins or do our merges for us.
>>>
>>>
>>> On Tue, Jul 17, 2018 at 11:04 AM Udi Meiri  wrote:
>>>
>>>> +1 to generating the file.
>>>> I'll go ahead and file a PR to remove CODEOWNERS
>>>>
>>>> On Tue, Jul 17, 2018 at 9:28 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> So it doesn’t support doing that right now, although if we find it’s a
>>>>> problem we can specify an exclude file with folks who haven’t contributed
>>>>> in the past year. Would people want me to generate that first?
>>>>>
>>>>> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía 
>>>>> wrote:
>>>>>
>>>>>> Is there a way to put inactive people as not reviewers for the blame
>>>>>> case? I think it can be useful considering that a good amount of our
>>>>>> committers are not active at the moment and auto-assigning reviews to
>>>>>> them seem like a waste of energy/time.
>>>>>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>> >
>>>>>> > We did not, but I think we should. So far, in 100% of the PRs I've
>>>>>> authored, the default functionality of CODEOWNERS did the wrong thing 
>>>>>> and I
>>>>>> had to fix something up manually.
>>>>>> >
>>>>>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> This sounds like a good plan. Did we want to rename the CODEOWNERS
>>>>>> file to disable github's mass adding of reviewers while we figure this 
>>>>>> out?
>>>>>> >>
>>>>>> >> Andrew
>>>>>> >>
>>>>>> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net> wrote:
>>>>>> >>>
>>>>>> >>> +1
>>>>>> >>>
>>>>>> >>> Le 16 juil. 2018, à 19:17, Holden Karau 
>>>>>> a écrit:
>>>>>> >>>>
>>>>>> >&

Removing documentation for old Beam versions

2018-07-31 Thread Udi Meiri

Hi all,
I'm writing a PR for apache/beam-site and beam_PreCommit_Website_Stage is
timing out after 100 minutes, because it's trying to deletes 22k files and
then copy 22k files (warning large file

).

It seems that we could save a lot of time by deleting the older javadoc and
pydoc files for older versions. Is there a good reason to keep around this
kind of documentation for older versions (say 1 year back)?


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-08-01 Thread Udi Meiri

Hi, so I saw mention bot working
<https://github.com/apache/beam/pull/6104#issuecomment-409077055> this week.
How was the quality of suggestions?

Holden, I would like to start testing Prow starting next week if
that's possible.
I'll be opening a ticket to INFRA to give my Github bot account read access
(for requesting reviews).

On Fri, Jul 27, 2018, 09:37 Udi Meiri  wrote:

> Summary doc for CODEOWNERS, Mention-bot, Prow:
> https://docs.google.com/document/d/1S8spggJsxDNYZ7aNwZN6VhLhNW372SVRezjblt-7lNQ/edit?usp=sharing
> This doc will get updated as we gain experience with Mention-bot and Prow.
>
> On Wed, Jul 25, 2018 at 5:15 PM Udi Meiri  wrote:
>
>> So I configured Prow using their getting started guide (and found a bug
>> in it) on a test repo.
>>
>> TLDR: Prow can work for us as a review assignment tool if all potential
>> reviewers are also added to the https://github.com/apache org.
>>
>> Some findings:
>> 1. Github doesn't allow non-collaborators to be listed as reviewers. :(
>> But! anyone added to the Apache org on Github may be added as a reviewer.
>> (no write access needed)
>> Is this something the ASF is willing to consider?
>>
>> 2. Prow works pretty well. I've configured it to just assign code
>> reviewers.
>> Here's an example of it in action:
>> https://github.com/udim-org/prow-test/pull/6
>> Essentially, the command we would use are:
>> "/cc @user" - to explicitly add a reviewer (/uncc to remove)
>>
>> Other command in the example above are not necessary.
>> We can still use our current PR approval and merge process.
>>
>> 3. Prow currently tries to assign 2 code reviewers, and hopefully that's
>> configurable.
>>
>> Still unsure:
>> 1. How does Prow select reviewers? Does it load balance?
>>
>> On Mon, Jul 23, 2018 at 9:51 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> It looks interesting but I would like to see the complete video and
>>> explanation about prow. Especially what we concretely need.
>>>
>>> Regards
>>> JB
>>> Le 24 juil. 2018, à 04:17, Udi Meiri  a écrit:
>>>>
>>>> I was recently told about Prow
>>>> <https://github.com/kubernetes/test-infra/tree/master/prow>, which
>>>> automates testing and merging for Kubernetes and other projects.
>>>> It also automates assigning reviewers and suggesting approvers. Example
>>>> <https://github.com/kubernetes/community/pull/2381> PR, video
>>>> explanation <https://youtu.be/BsIC7gPkH5M?t=19m41s>
>>>> I propose trying out Prow, since is it's a maintained and it uses
>>>> OWNERS files to explicitly define both who should be reviewing and who
>>>> should approve a PR.
>>>>
>>>> I'm not suggesting we use it to replace Jenkins or do our merges for us.
>>>>
>>>>
>>>> On Tue, Jul 17, 2018 at 11:04 AM Udi Meiri  wrote:
>>>>
>>>>> +1 to generating the file.
>>>>> I'll go ahead and file a PR to remove CODEOWNERS
>>>>>
>>>>> On Tue, Jul 17, 2018 at 9:28 AM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> So it doesn’t support doing that right now, although if we find it’s
>>>>>> a problem we can specify an exclude file with folks who haven’t 
>>>>>> contributed
>>>>>> in the past year. Would people want me to generate that first?
>>>>>>
>>>>>> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>>> Is there a way to put inactive people as not reviewers for the blame
>>>>>>> case? I think it can be useful considering that a good amount of our
>>>>>>> committers are not active at the moment and auto-assigning reviews to
>>>>>>> them seem like a waste of energy/time.
>>>>>>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov <
>>>>>>> kirpic...@google.com> wrote:
>>>>>>> >
>>>>>>> > We did not, but I think we should. So far, in 100% of the PRs I've
>>>>>>> authored, the default functionality of CODEOWNERS did the wrong thing 
>>>>>>> and I
>>>>>>> had to fix something up manually.
>>>>>>> >
>>>>>>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud <
>>>>>>> apill...@google.com> wrote:
>>>>>>&

Re: [DISCUSSION] Tracking & Visualizing various metrics of the Beam community

2018-08-07 Thread Udi Meiri

These tables look very cool!

I certainly don't object to using tools made by another organization. The
only issue might be compatibility with our process.

>From my understanding of the Kubernetes review process, they use Github's
tagging feature to specify PR statuses such as LGTM and approval.
Random example: https://github.com/kubernetes/kubernetes/pull/67089
I'm not sure that statistics such as pr-time-to-approve-and-merge would be
captured at all on apache/beam since we don't use these tags.

I would definitely like to see how useful this tool is regardless, and
since it doesn't seem to require any special permissions it should be
simple to set up a test instance.


On Tue, Aug 7, 2018 at 11:24 AM Huygaa Batsaikhan  wrote:

> tl;dr - is there any objections to trying out DevStats tool created by
> CNCF?
>
> On Mon, Aug 6, 2018 at 3:43 PM Huygaa Batsaikhan 
> wrote:
>
>> Continuing the discussion
>> 
>> about improving Beam code review, I am looking into visualizing various
>> helpful Beam community metrics such as code velocity, reviewer load, and
>> new contributor's engagement.
>>
>> So far, I found DevStats
>> , an
>> open source (github ) dashboarding
>> tool used by Kubernetes, seems to provide almost everything we need. For
>> example, they have dashboards for metrics such as:
>>
>>-
>>
>>Time to approve or merge
>>
>>-
>>
>>PR Time to engagement
>>
>>-
>>
>>New and Episodic PR contributors
>>
>> 
>>-
>>
>>PR reviews by contributor
>>
>>-
>>
>>Company statistics
>>
>> It would be really cool if we can try it out for Beam. I don't have much
>> experience using open source projects. From what I understand: DevStats is
>> developed by CNCF  and they manage their incubator
>> projects' dashboard. Since Beam is not part of the CNCF, in order to use
>> DevStats, we have to fork the project and maintain it ourselves.
>>
>> 1. What do you think about using DevStats for Beam? Do you know how it is
>> usually done?
>> 2. If you are not sure about DevStats, do you know any other tool which
>> could help us track & visualize Beam metrics?
>>
>> Thanks, Huygaa
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: gradlew :rat broken

2018-08-15 Thread Udi Meiri

Found an elegant solution: https://github.com/apache/beam/pull/6236

On Wed, Aug 15, 2018 at 4:25 PM Udi Meiri  wrote:

> nosetests.xml doesn't seem to get cleaned by :clean. Same goes for the
> containers sub-directory.
>
> On Wed, Aug 15, 2018 at 4:23 PM Yifan Zou  wrote:
>
>> I've also encountered that problem before. The nosetests.xml doesn't have
>> the license specified and you have to removed it before running the :rat
>> task locally.
>>
>> On Wed, Aug 15, 2018 at 4:19 PM Pablo Estrada  wrote:
>>
>>> The files that are causing the failure seem to have been generated. They
>>> are not part of the checked-in source code:
>>> https://github.com/apache/beam/tree/master/sdks/python/container
>>>
>>> Your files are container/vendor/, and sdks/python/nosetests.xml
>>> (which is generated when running nosetests).
>>>
>>> On Wed, Aug 15, 2018 at 4:11 PM Udi Meiri  wrote:
>>>
>>>> Hi,
>>>> Whenever I run ../../gradlew :rat from the sdks/python directory I get
>>>> errors about unapproved licenses. Example:
>>>> https://scans.gradle.com/s/gdpcgbexwyrpm
>>>>
>>>> This didn't use to be the case a few weeks back.
>>>> Any ideas what could be causing this?
>>>>
>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: gradlew :rat broken

2018-08-15 Thread Udi Meiri

nosetests.xml doesn't seem to get cleaned by :clean. Same goes for the
containers sub-directory.

On Wed, Aug 15, 2018 at 4:23 PM Yifan Zou  wrote:

> I've also encountered that problem before. The nosetests.xml doesn't have
> the license specified and you have to removed it before running the :rat
> task locally.
>
> On Wed, Aug 15, 2018 at 4:19 PM Pablo Estrada  wrote:
>
>> The files that are causing the failure seem to have been generated. They
>> are not part of the checked-in source code:
>> https://github.com/apache/beam/tree/master/sdks/python/container
>>
>> Your files are container/vendor/, and sdks/python/nosetests.xml
>> (which is generated when running nosetests).
>>
>> On Wed, Aug 15, 2018 at 4:11 PM Udi Meiri  wrote:
>>
>>> Hi,
>>> Whenever I run ../../gradlew :rat from the sdks/python directory I get
>>> errors about unapproved licenses. Example:
>>> https://scans.gradle.com/s/gdpcgbexwyrpm
>>>
>>> This didn't use to be the case a few weeks back.
>>> Any ideas what could be causing this?
>>>
>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

gradlew :rat broken

2018-08-15 Thread Udi Meiri

Hi,
Whenever I run ../../gradlew :rat from the sdks/python directory I get
errors about unapproved licenses. Example:
https://scans.gradle.com/s/gdpcgbexwyrpm

This didn't use to be the case a few weeks back.
Any ideas what could be causing this?


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PROPOSAL] Prepare Beam 2.7.0 release

2018-08-23 Thread Udi Meiri

+1

On Mon, Aug 20, 2018 at 3:33 PM Boyuan Zhang  wrote:

> +1
> Thanks for volunteering, Charles!
>
> On Mon, Aug 20, 2018 at 3:22 PM Rafael Fernandez 
> wrote:
>
>> +1, thanks for volunteering, Charles!
>>
>> On Mon, Aug 20, 2018 at 12:09 PM Charles Chen  wrote:
>>
>>> Thank you Andrew for pointing out my mistake.  We should follow the
>>> calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier.
>>>
>>> On Mon, Aug 20, 2018 at 12:02 PM Andrew Pilloud 
>>> wrote:
>>>
 +1 Thanks for volunteering! The calendar I have puts the cut date at
 August 29th, which looks to be 6 weeks from when 2.6.0 was cut. Do I have
 the wrong calendar?

 See:
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles

 Andrew

 On Mon, Aug 20, 2018 at 11:43 AM Connell O'Callaghan <
 conne...@google.com> wrote:

> +1 Charles thank you for taking this up and helping us maintain this
> schedule.
>
> On Mon, Aug 20, 2018 at 11:29 AM Charles Chen  wrote:
>
>> Hey everyone,
>>
>> Our release calendar indicates that the process for the 2.7.0 Beam
>> release should start on September 7.
>>
>> I volunteer to perform this release and propose the following
>> schedule:
>>
>>- We start triaging issues in JIRA this week.
>>- I will cut the initial 2.7.0 release branch on September 7.
>>- After September 7, any blockers will need to be manually
>>cherry-picked into the release branch.
>>- After tests pass and blockers are fully addressed, I will move
>>on and perform other release tasks.
>>
>> What do you think?
>>
>> Best,
>> Charles
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Removing documentation for old Beam versions

2018-08-24 Thread Udi Meiri

I'm picking up the website migration. The plan is to not include generated
files in the master branch.

However, I've been told that even putting generated files a separate branch
could blow up the git repository for all (e.g. make git pulls a lot
longer?).
Not sure if this is a real issue or not.

On Mon, Aug 20, 2018 at 2:53 AM Robert Bradshaw  wrote:

> On Sun, Aug 5, 2018 at 5:28 AM Thomas Weise  wrote:
> >
> > Yes, I think the separation of generated code will need to occur prior
> to completing the merge and switching the web site to the main repo.
> >
> > There should be no reason to check generated documentation into either
> of the repos/branches.
>
> Huge +1 to this. Thomas, would have time to set something like this up
> for Beam? If not, could anyone else pick this up?
>
> > Please see as an example how this was solved in Flink, using the ASF
> buildbot infrastructure.
> >
> > Documentation per version/release, for example:
> >
> > https://ci.apache.org/projects/flink/flink-docs-release-1.5/
> >
> > The buildbot configuration is here (requires committer access):
> >
> >
> https://svn.apache.org/repos/infra/infrastructure/buildbot/aegis/buildmaster/master1/projects/flink.conf
> >
> > Thanks,
> > Thomas
> >
> > On Thu, Aug 2, 2018 at 6:46 PM Mikhail Gryzykhin 
> wrote:
> >>
> >> Last time I talked with Scott I brought this idea in. I believe the
> plan was either to publish compiled site to website directly, or keep it in
> separate storage from apache/beam repo.
> >>
> >> One of the main reasons not to check in compiled version of website is
> that every developer will have to pull all the versions of website every
> time they clone repo, which is not that good of an idea to do.
> >>
> >> Regards,
> >> --Mikhail
> >>
> >> Have feedback?
> >>
> >>
> >> On Thu, Aug 2, 2018 at 6:42 PM Udi Meiri  wrote:
> >>>
> >>> Pablo, the docs are generated into versioned paths, e.g.,
> https://beam.apache.org/documentation/sdks/javadoc/2.5.0/ so tags are not
> necessary?
> >>> Also, once apache/beam-site is merged with apache/beam the release
> branch should have the relevant docs (although perhaps it's better to put
> them in a different repo or storage system).
> >>>
> >>> Thomas, I would very much like to not have javadoc/pydoc generation be
> part of the website review process, as it takes up a lot of time when
> changes are staged (10s of thousands of files), especially when a PR is
> updated and existing staged files need to be deleted.
> >>>
> >>>
> >>> On Thu, Aug 2, 2018 at 1:15 PM Mikhail Gryzykhin 
> wrote:
> >>>>
> >>>> +1 For removing old documentation.
> >>>>
> >>>> @Thomas: Migration work is in backlog and will be picked up in near
> time.
> >>>>
> >>>> --Mikhail
> >>>>
> >>>> Have feedback?
> >>>>
> >>>>
> >>>> On Thu, Aug 2, 2018 at 12:54 PM Thomas Weise  wrote:
> >>>>>
> >>>>> +1 for removing pre 2.0 documentation (as well as the entries from
> https://beam.apache.org/get-started/downloads/)
> >>>>>
> >>>>> Isn't it part of the beam-site changes that we will no longer check
> in generated documentation into the repository? Those can be generated and
> deployed independently (when a commit to a branch occurs), such as done in
> the Apex and Flink projects.
> >>>>>
> >>>>> I was told that Scott who was working in the beam-site changes is on
> leave now and the migration is still pending (see note at
> https://github.com/apache/beam/tree/master/website). Is anyone else going
> to pick it up?
> >>>>>
> >>>>> Thanks,
> >>>>> Thomas
> >>>>>
> >>>>>
> >>>>> On Thu, Aug 2, 2018 at 12:33 PM Pablo Estrada 
> wrote:
> >>>>>>
> >>>>>> Is it worth adding a tag / branch to the repositories every time we
> make a release, so that people are able to dive in and find the docs?
> >>>>>> Best
> >>>>>> -P.
> >>>>>>
> >>>>>> On Thu, Aug 2, 2018 at 12:09 PM Ahmet Altay 
> wrote:
> >>>>>>>
> >>>>>>> I would guess that users are still using some of these old
> releases. It is unclear from Beam website which releases are still
> supported or not. It pr

Re: BEAM-5180 for 2.7.0 ?

2018-08-24 Thread Udi Meiri

+Ankur Goenka  (Kenneth is out of office)

On Fri, Aug 24, 2018 at 3:20 AM Tim Robertson 
wrote:

> Thanks Jozef for bringing this to dev@ and your work in reporting Jiras
> and offering fixes.
>
> I propose we consider BEAM-5180, BEAM-2277 blockers on 2.7.0. They break
> word count and file IO writing on HDFS unless the workaround is used (see
> BEAM-2277 commentary).
>
> In addition the performance of writing to HDFS is bad (to the point of
> being unusable) due to BEAM-5036 (blocked by BEAM-4861) which you have also
> observed. I don’t think we’ll make the cut for 2.7.0 on those because I
> anticipate significant testing is needed. I will focus on that as soon as I
> have time but I propose we make those blockers on 2.8.0 though.
>
>
>
> On Fri, Aug 24, 2018 at 7:53 AM Jozef Vilcek 
> wrote:
>
>> Hello,
>>
>> does this JIRA have a change to be part of 2.7.0 release?
>> https://issues.apache.org/jira/browse/BEAM-5180
>>
>> It is rather simple ask, but was not decided yet if attached PR is a
>> correct way of fixing it.
>>
>> Thanks,
>> Jozef
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Design Proposal: Beam-Site Automation Reliability

2018-08-28 Thread Udi Meiri

FYI, we are about to add a new branch to apache/beam, named 'asf-site',
which will contain generated website sources.

On Thu, Jun 7, 2018 at 10:18 AM Jason Kuster  wrote:

> Sounds good; I'm really excited about these changes Scott. Thanks for
> taking this on!
>
> On Tue, Jun 5, 2018 at 4:00 PM Scott Wegner  wrote:
>
>> Thanks everyone; I've responded to feedback in the doc [1] and I believe
>> we've reached consensus. I've added implementation tasks in JIRA
>> under BEAM-4493 [2] and will start coding soon. As a recap, the high-level
>> plan is:
>>
>> * Migrate website source code to the main apache/beam repository
>> * Discontinue checking-in generated HTML during the PR workflow
>> * Align to the existing apache/beam PR process (code review policy,
>> precommits, generic Git merge)
>> * Filter pre-commit jobs to only run when necessary
>> * Add a post-commit Jenkins job to push generated HTML to a separate
>> publishing branch
>>
>> [1] https://s.apache.org/beam-site-automation
>> [2] https://issues.apache.org/jira/browse/BEAM-4493
>>
>> On Fri, Jun 1, 2018 at 10:33 AM Scott Wegner  wrote:
>>
>>> Pre-commit filtering has come up on previous discussions as well and is
>>> an obvious improvement. I've opened BEAM-4445 [1] for this and assigned it
>>> to myself.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-4445
>>>
>>> On Fri, Jun 1, 2018 at 10:01 AM Kenneth Knowles  wrote:
>>>
 +1

 Can we separate precommit filtering and get it set up independent from
 this? I think there's a lot of good directions to go once it is the norm.

 On Thu, May 31, 2018 at 9:25 PM Thomas Weise  wrote:

> Very nice, enthusiastic +1
>
> On Thu, May 31, 2018 at 3:24 PM, Scott Wegner 
> wrote:
>
>> Thanks to everyone who reviewed the doc. I put together a plan based
>> on the initial feedback to improve website automation reliability. At a
>> glance, I am proposing to:
>>
>> * Migrate website source code to the main apache/beam repository
>> * Discontinue checking-in generated HTML during the PR workflow
>> * Align to the existing apache/beam PR process (code review policy,
>> precommits, generic Git merge)
>> * Filter pre-commit jobs to only run when necessary
>> * Add a post-commit Jenkins job to push generated HTML to a separate
>> publishing branch
>>
>> Please take another look at the doc, specifically the new section
>> entitled "Proposed Solution":
>> https://s.apache.org/beam-site-automation
>> I'd like to gather feedback by Monday June 4, and if there is
>> consensus move forward with the implementation.
>>
>> Thanks,
>> Scott
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>> On Tue, May 29, 2018 at 4:32 PM Scott Wegner 
>> wrote:
>>
>>> I've been looking into the beam-site merge automation reliability,
>>> and I'd like to get some early feedback on ideas for improvement. Please
>>> take a look at https://s.apache.org/beam-site-automation:
>>>
>>> > Apache Beam's website is maintained via the beam-site Git
>>> repository, with a set of automation that manages the workflow from 
>>> merging
>>> a pull request to publishing. The automation is centralized in a tool
>>> called Mergebot, which was built for Beam and donated to the ASF. 
>>> However,
>>> the automation has been somewhat unreliable, and when there are issues,
>>> very few individuals have the necessary permissions and expertise to
>>> resolve them. Overall, the reliability of Beam-site automation is 
>>> impeding
>>> productivity for Beam-site development.
>>>
>>> At this point I'm seeking feedback on a few possible solutions:
>>>
>>> 1. Invest in improvements to Mergebot reliability. Make stability
>>> tweaks for various failure modes, distribute Mergebot expertise and
>>> operations permissions to more committers.
>>> 2. Deprecate Mergebot and revert to manual process. With the current
>>> unreliability, some committers choose to forego merge automation anyway.
>>> 3. Generate HTML only during publishing. This seems to be newly
>>> supported by the Apache GitPubSub workflow. This would eliminate most or
>>> all of the automation that Mergebot is responsible for.
>>>
>>> Feel free to add comments in the doc.
>>>
>>> Thanks,
>>> Scott
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> 
>


smime.p7s
Description: S/MIME Cryptographic Signature

jira search in chrome omnibox

2018-08-27 Thread Udi Meiri

In case you want to quickly look up JIRA tickets, e.g., typing 'j', space,
'BEAM-4696'.
Search URL: https://issues.apache.org/jira/QuickSearch.jspa?searchString=%s


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-17 Thread Udi Meiri

+1 to generating the file.
I'll go ahead and file a PR to remove CODEOWNERS

On Tue, Jul 17, 2018 at 9:28 AM Holden Karau  wrote:

> So it doesn’t support doing that right now, although if we find it’s a
> problem we can specify an exclude file with folks who haven’t contributed
> in the past year. Would people want me to generate that first?
>
> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía  wrote:
>
>> Is there a way to put inactive people as not reviewers for the blame
>> case? I think it can be useful considering that a good amount of our
>> committers are not active at the moment and auto-assigning reviews to
>> them seem like a waste of energy/time.
>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
>> wrote:
>> >
>> > We did not, but I think we should. So far, in 100% of the PRs I've
>> authored, the default functionality of CODEOWNERS did the wrong thing and I
>> had to fix something up manually.
>> >
>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
>> wrote:
>> >>
>> >> This sounds like a good plan. Did we want to rename the CODEOWNERS
>> file to disable github's mass adding of reviewers while we figure this out?
>> >>
>> >> Andrew
>> >>
>> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> Le 16 juil. 2018, à 19:17, Holden Karau  a
>> écrit:
>> >>>>
>> >>>> Ok if no one objects I'll create the INFRA ticket after OSCON and we
>> can test it for a week and decide if it helps or hinders.
>> >>>>
>> >>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
>> wrote:
>> >>>>>
>> >>>>> Agree to test it for a week.
>> >>>>>
>> >>>>> Regards
>> >>>>> JB
>> >>>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com>
>> a écrit:
>> >>>>>>
>> >>>>>> Would folks be OK with me asking infra to turn on blame based
>> suggestions for Beam and trying it out for a week?
>> >>>>>>
>> >>>>>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez <
>> rfern...@google.com> wrote:
>> >>>>>>>
>> >>>>>>> +1 using blame -- nifty :)
>> >>>>>>>
>> >>>>>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan <
>> bat...@google.com> wrote:
>> >>>>>>>>
>> >>>>>>>> +1. This is great.
>> >>>>>>>>
>> >>>>>>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Mention bot looks cool, as it tries to guess the reviewer using
>> blame.
>> >>>>>>>>> I've written a quick and dirty script that uses only CODEOWNERS.
>> >>>>>>>>>
>> >>>>>>>>> Its output looks like:
>> >>>>>>>>> $ python suggest_reviewers.py --pr 5940
>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>> (path_pattern: /runners/core-construction-java*)
>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>> (path_pattern: /runners/core-construction-java*)
>> >>>>>>>>> INFO:root:Selected reviewer @echauchot for:
>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>> (path_pattern: /runners/core-java*)
>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/flink/build.gradle (path_pattern: */build.gradle*)
>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>> (path_pattern: *.java)
>> >>>>>>>>> INFO:root:Selected reviewer @pabloem for:
>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRun

Re: CODEOWNERS for apache/beam repo

2018-07-23 Thread Udi Meiri

I was recently told about Prow
<https://github.com/kubernetes/test-infra/tree/master/prow>, which
automates testing and merging for Kubernetes and other projects.
It also automates assigning reviewers and suggesting approvers. Example
<https://github.com/kubernetes/community/pull/2381> PR, video explanation
<https://youtu.be/BsIC7gPkH5M?t=19m41s>
I propose trying out Prow, since is it's a maintained and it uses OWNERS
files to explicitly define both who should be reviewing and who should
approve a PR.

I'm not suggesting we use it to replace Jenkins or do our merges for us.


On Tue, Jul 17, 2018 at 11:04 AM Udi Meiri  wrote:

> +1 to generating the file.
> I'll go ahead and file a PR to remove CODEOWNERS
>
> On Tue, Jul 17, 2018 at 9:28 AM Holden Karau  wrote:
>
>> So it doesn’t support doing that right now, although if we find it’s a
>> problem we can specify an exclude file with folks who haven’t contributed
>> in the past year. Would people want me to generate that first?
>>
>> On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía  wrote:
>>
>>> Is there a way to put inactive people as not reviewers for the blame
>>> case? I think it can be useful considering that a good amount of our
>>> committers are not active at the moment and auto-assigning reviews to
>>> them seem like a waste of energy/time.
>>> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
>>> wrote:
>>> >
>>> > We did not, but I think we should. So far, in 100% of the PRs I've
>>> authored, the default functionality of CODEOWNERS did the wrong thing and I
>>> had to fix something up manually.
>>> >
>>> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
>>> wrote:
>>> >>
>>> >> This sounds like a good plan. Did we want to rename the CODEOWNERS
>>> file to disable github's mass adding of reviewers while we figure this out?
>>> >>
>>> >> Andrew
>>> >>
>>> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> Le 16 juil. 2018, à 19:17, Holden Karau  a
>>> écrit:
>>> >>>>
>>> >>>> Ok if no one objects I'll create the INFRA ticket after OSCON and
>>> we can test it for a week and decide if it helps or hinders.
>>> >>>>
>>> >>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >>>>>
>>> >>>>> Agree to test it for a week.
>>> >>>>>
>>> >>>>> Regards
>>> >>>>> JB
>>> >>>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com>
>>> a écrit:
>>> >>>>>>
>>> >>>>>> Would folks be OK with me asking infra to turn on blame based
>>> suggestions for Beam and trying it out for a week?
>>> >>>>>>
>>> >>>>>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez <
>>> rfern...@google.com> wrote:
>>> >>>>>>>
>>> >>>>>>> +1 using blame -- nifty :)
>>> >>>>>>>
>>> >>>>>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan <
>>> bat...@google.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> +1. This is great.
>>> >>>>>>>>
>>> >>>>>>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Mention bot looks cool, as it tries to guess the reviewer
>>> using blame.
>>> >>>>>>>>> I've written a quick and dirty script that uses only
>>> CODEOWNERS.
>>> >>>>>>>>>
>>> >>>>>>>>> Its output looks like:
>>> >>>>>>>>> $ python suggest_reviewers.py --pr 5940
>>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>> (path_pattern: /runners/core-construction-java*)
>>> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-const

Re: builds.apache.org refused connections since last night

2018-08-31 Thread Udi Meiri

My mistake. I thought you needed to appear as "member" on apache/beam-site
as well.

On Fri, Aug 31, 2018 at 5:06 PM Thomas Weise  wrote:

> All committers should be able to merge the site changes and wasn't that
> the case till the Jenkins outage?
>
>
> On Fri, Aug 31, 2018 at 3:19 PM Udi Meiri  wrote:
>
>> I believe the bot only listens to members.
>>
>> On Fri, Aug 31, 2018 at 2:56 PM Thomas Weise  wrote:
>>
>>> Any idea why the beam-site merge bot doesn't work?
>>>
>>> The PRs are showing a "Merging is blocked" check that I don't remember
>>> seeing before.
>>>
>>>
>>> On Fri, Aug 31, 2018 at 2:06 AM Maximilian Michels 
>>> wrote:
>>>
>>>> Jenkins is up again! (woho!)
>>>>
>>>> On 30.08.18 20:23, Thomas Weise wrote:
>>>> > I would be concerned with multiple folks running the Jekyll build
>>>> > locally to end up with inconsistent results. But if Jenkins stays
>>>> down
>>>> > for longer, then maybe one of us can be the Jenkins substitute :)
>>>> >
>>>> >
>>>> > On Thu, Aug 30, 2018 at 10:52 AM Boyuan Zhang >>> > <mailto:boyu...@google.com>> wrote:
>>>> >
>>>> > Hey Thomas,
>>>> >
>>>> > I guess a comitter can push changes directly into
>>>> > https://gitbox.apache.org/repos/asf?p=beam-site.git. Maybe can
>>>> have
>>>> > a try with a comitter's help.
>>>> >
>>>> > On Thu, Aug 30, 2018 at 10:34 AM Thomas Weise >>> > <mailto:t...@apache.org>> wrote:
>>>> >
>>>> > While Jenkins is down, is there an alternative process to
>>>> merge
>>>> > web site changes?
>>>> >
>>>> >
>>>> > On Wed, Aug 29, 2018 at 9:19 AM Boyuan Zhang <
>>>> boyu...@google.com
>>>> > <mailto:boyu...@google.com>> wrote:
>>>> >
>>>> > Thank you Andrew!
>>>> >
>>>> > On Wed, Aug 29, 2018 at 9:17 AM Andrew Pilloud
>>>> > mailto:apill...@google.com>> wrote:
>>>> >
>>>> > Down for me too. It sounds like the disk failed and it
>>>> > will be down for a while: https://status.apache.org/
>>>> >
>>>> > Andrew
>>>> >
>>>> > On Wed, Aug 29, 2018 at 9:13 AM Boyuan Zhang
>>>> > mailto:boyu...@google.com>>
>>>> wrote:
>>>> >
>>>> > Hey all,
>>>> >
>>>> > It seems like that builds.apache.org
>>>> > <http://builds.apache.org> cannot be reached from
>>>> > last night. Does anyone have the sam e connection
>>>> > problem? Any idea what can we do?
>>>> >
>>>> > Boyuan Zhang
>>>> >
>>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: builds.apache.org refused connections since last night

2018-08-31 Thread Udi Meiri

I believe the bot only listens to members.

On Fri, Aug 31, 2018 at 2:56 PM Thomas Weise  wrote:

> Any idea why the beam-site merge bot doesn't work?
>
> The PRs are showing a "Merging is blocked" check that I don't remember
> seeing before.
>
>
> On Fri, Aug 31, 2018 at 2:06 AM Maximilian Michels  wrote:
>
>> Jenkins is up again! (woho!)
>>
>> On 30.08.18 20:23, Thomas Weise wrote:
>> > I would be concerned with multiple folks running the Jekyll build
>> > locally to end up with inconsistent results. But if Jenkins stays down
>> > for longer, then maybe one of us can be the Jenkins substitute :)
>> >
>> >
>> > On Thu, Aug 30, 2018 at 10:52 AM Boyuan Zhang > > > wrote:
>> >
>> > Hey Thomas,
>> >
>> > I guess a comitter can push changes directly into
>> > https://gitbox.apache.org/repos/asf?p=beam-site.git. Maybe can have
>> > a try with a comitter's help.
>> >
>> > On Thu, Aug 30, 2018 at 10:34 AM Thomas Weise > > > wrote:
>> >
>> > While Jenkins is down, is there an alternative process to merge
>> > web site changes?
>> >
>> >
>> > On Wed, Aug 29, 2018 at 9:19 AM Boyuan Zhang <
>> boyu...@google.com
>> > > wrote:
>> >
>> > Thank you Andrew!
>> >
>> > On Wed, Aug 29, 2018 at 9:17 AM Andrew Pilloud
>> > mailto:apill...@google.com>> wrote:
>> >
>> > Down for me too. It sounds like the disk failed and it
>> > will be down for a while: https://status.apache.org/
>> >
>> > Andrew
>> >
>> > On Wed, Aug 29, 2018 at 9:13 AM Boyuan Zhang
>> > mailto:boyu...@google.com>> wrote:
>> >
>> > Hey all,
>> >
>> > It seems like that builds.apache.org
>> >  cannot be reached from
>> > last night. Does anyone have the sam e connection
>> > problem? Any idea what can we do?
>> >
>> > Boyuan Zhang
>> >
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: jira search in chrome omnibox

2018-08-30 Thread Udi Meiri

Correction: this is the correct URL:
https://issues.apache.org/jira/secure/QuickSearch.jspa?searchString=%s

It uses smart querying. Ex: Searching for "beam open pubsub" will search
for open bugs in project BEAM with the keyword "pubsub".

On Tue, Aug 28, 2018 at 4:49 PM Valentyn Tymofieiev 
wrote:

> Thanks for sharing.
>
> I have also found useful following custom search query for PRs:
> https://github.com/apache/beam/pulls?q=is%3Apr%20%s
>
> Sample usage: type 'pr', space, type: 'author:tvalentyn'.
>
> You could also incorporate 'author:' into the query:
> https://github.com/apache/beam/pulls?q=is%3Apr%20author%3A
>
> On Tue, Aug 28, 2018 at 4:26 PM Daniel Oliveira 
> wrote:
>
>> This seems pretty useful. Thanks Udi!
>>
>> On Mon, Aug 27, 2018 at 3:54 PM Udi Meiri  wrote:
>>
>>> In case you want to quickly look up JIRA tickets, e.g., typing 'j',
>>> space, 'BEAM-4696'.
>>> Search URL:
>>> https://issues.apache.org/jira/QuickSearch.jspa?searchString=%s
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature

CODEOWNERS for apache/beam repo

2018-07-09 Thread Udi Meiri

Hi everyone,

I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
mechanism.
Initial version is here: *https://github.com/apache/beam/pull/5909/files
*

I need help from the community in determining owners for each component.
Feel free to directly edit the PR (if you have permission) or add a comment.


Background
The idea is to:
1. Document good review candidates for each component.
2. Help choose reviewers using the auto-assignment mechanism. The
suggestion is in no way binding.


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-10 Thread Udi Meiri

Robert, the docs <https://help.github.com/articles/about-codeowners/> say:
"The people you choose as code owners must have write permissions for the
repository,"
but I want to ignore that for now and see what happens. :)

Also, I'm not sure how the auto-assignment works. Does it round-robin? Will
it assign more than one reviewer?
We'll see how it goes.

Feel free to add components as you see fit. Most of what I've done is
guesswork.

On Tue, Jul 10, 2018 at 11:13 AM Ahmet Altay  wrote:

> +1
>
> I added my name and a few others names that frequently do reviews in some
> areas.
>
> On Tue, Jul 10, 2018 at 10:23 AM, Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> I added my name on some components ;)
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 10/07/2018 02:06, Udi Meiri wrote:
>> > Hi everyone,
>> >
>> > I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
>> > mechanism.
>> > Initial version is here: _
>> https://github.com/apache/beam/pull/5909/files_
>> >
>> > I need help from the community in determining owners for each component.
>> > Feel free to directly edit the PR (if you have permission) or add a
>> comment.
>> >
>> >
>> > Background
>> > The idea is to:
>> > 1. Document good review candidates for each component.
>> > 2. Help choose reviewers using the auto-assignment mechanism. The
>> > suggestion is in no way binding.
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-12 Thread Udi Meiri

:/ That makes it a little less useful.

On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson 
wrote:

> Hi Udi
>
> I asked the GH helpdesk and they confirmed that only people with write
> access will actually be automatically chosen.
>
> It don't expect it should stop us using it, but we should be aware that
> there are non-committers also willing to review.
>
> Thanks,
> Tim
>
> On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin 
> wrote:
>
>> Idea looks good in general.
>>
>> Did you look into ways to keep this file up-to-date? For example we can
>> run monthly job to see if owner was active during this period.
>>
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri  wrote:
>>
>>> Thanks all!
>>> I'll try to get the file merged today and see how it works out.
>>> Please surface any issues, such as with auto-assignment, here or in JIRA.
>>>
>>> On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I added myself as a reviewer for some modules.
>>>>
>>>> Etienne
>>>>
>>>> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
>>>> mechanism.
>>>> Initial version is here: *https://github.com/apache/beam/pull/5909/files
>>>> <https://github.com/apache/beam/pull/5909/files>*
>>>>
>>>> I need help from the community in determining owners for each component.
>>>> Feel free to directly edit the PR (if you have permission) or add a
>>>> comment.
>>>>
>>>>
>>>> Background
>>>> The idea is to:
>>>> 1. Document good review candidates for each component.
>>>> 2. Help choose reviewers using the auto-assignment mechanism. The
>>>> suggestion is in no way binding.
>>>>
>>>>
>>>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-12 Thread Udi Meiri

Thanks all!
I'll try to get the file merged today and see how it works out.
Please surface any issues, such as with auto-assignment, here or in JIRA.

On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot 
wrote:

> Hi,
>
> I added myself as a reviewer for some modules.
>
> Etienne
>
> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
>
> Hi everyone,
>
> I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
> mechanism.
> Initial version is here: *https://github.com/apache/beam/pull/5909/files
> <https://github.com/apache/beam/pull/5909/files>*
>
> I need help from the community in determining owners for each component.
> Feel free to directly edit the PR (if you have permission) or add a
> comment.
>
>
> Background
> The idea is to:
> 1. Document good review candidates for each component.
> 2. Help choose reviewers using the auto-assignment mechanism. The
> suggestion is in no way binding.
>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Udi Meiri

Hi Etienne,

Yes you could be as precise as you want. The paths I listed are just
suggestions. :)


On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I think it's already do-able just providing the expected path.
>
> It's a good idea especially for the core.
>
> Regards
> JB
>
> On 13/07/2018 09:51, Etienne Chauchot wrote:
> > Hi Udi,
> >
> > I also have a question, related to what Eugene asked : I see that the
> > code paths are the ones of the modules. Can we be more precise than that
> > to assign reviewers ? As an example, I added myself to runner/core
> > because I wanted to take a look at the PRs related to
> > runner/core/metrics but I'm getting assigned to all runner-core PRs. Can
> > we specify paths like
> > runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
> > I know it is a bit too precise so a bit risky, but in that particular
> > case, I doubt that the path will change.
> >
> > Etienne
> >
> > Le jeudi 12 juillet 2018 à 16:49 -0700, Eugene Kirpichov a écrit :
> >> Hi Udi,
> >>
> >> I see that the PR was merged - thanks! However it seems to have some
> >> unintended effects.
> >>
> >> On my PR https://github.com/apache/beam/pull/5940 , I assigned a
> >> reviewer manually, but the moment I pushed a new commit, it
> >> auto-assigned a lot of other people to it, and I had to remove them.
> >> This seems like a big inconvenience to me, is there a way to disable
> this?
> >>
> >> Thanks.
> >>
> >> On Thu, Jul 12, 2018 at 2:53 PM Udi Meiri  >> <mailto:eh...@google.com>> wrote:
> >>> :/ That makes it a little less useful.
> >>>
> >>> On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson
> >>> mailto:timrobertson...@gmail.com>> wrote:
> >>>> Hi Udi
> >>>>
> >>>> I asked the GH helpdesk and they confirmed that only people with
> >>>> write access will actually be automatically chosen.
> >>>>
> >>>> It don't expect it should stop us using it, but we should be aware
> >>>> that there are non-committers also willing to review.
> >>>>
> >>>> Thanks,
> >>>> Tim
> >>>>
> >>>> On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin
> >>>> mailto:mig...@google.com>> wrote:
> >>>>> Idea looks good in general.
> >>>>>
> >>>>> Did you look into ways to keep this file up-to-date? For example we
> >>>>> can run monthly job to see if owner was active during this period.
> >>>>>
> >>>>> --Mikhail
> >>>>>
> >>>>> Have feedback <http://go/migryz-feedback>?
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri  >>>>> <mailto:eh...@google.com>> wrote:
> >>>>>> Thanks all!
> >>>>>> I'll try to get the file merged today and see how it works out.
> >>>>>> Please surface any issues, such as with auto-assignment, here or
> >>>>>> in JIRA.
> >>>>>>
> >>>>>> On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot
> >>>>>> mailto:echauc...@apache.org>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I added myself as a reviewer for some modules.
> >>>>>>>
> >>>>>>> Etienne
> >>>>>>>
> >>>>>>> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I'm proposing to add auto-reviewer-assignment using Github's
> >>>>>>>> CODEOWNERS mechanism.
> >>>>>>>> Initial version is
> >>>>>>>> here: _https://github.com/apache/beam/pull/5909/files_
> >>>>>>>>
> >>>>>>>> I need help from the community in determining owners for each
> >>>>>>>> component.
> >>>>>>>> Feel free to directly edit the PR (if you have permission) or
> >>>>>>>> add a comment.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Background
> >>>>>>>> The idea is to:
> >>>>>>>> 1. Document good review candidates for each component.
> >>>>>>>> 2. Help choose reviewers using the auto-assignment mechanism.
> >>>>>>>> The suggestion is in no way binding.
> >>>>>>>>
> >>>>>>>>
> >>>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Udi Meiri

While I like the idea of having a CODEOWNERS file, the Github
implementation is lacking:
1. Reviewers are automatically assigned at each push.
2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's PR
5940).
3. Non-committers aren't assigned as reviewers.
4. Non-committers can't change the list of reviewers.

I propose renaming the file to disable the auto-reviewer assignment feature.
In its place I'll add a script that suggests reviewers.

On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:

> Hi Etienne,
>
> Yes you could be as precise as you want. The paths I listed are just
> suggestions. :)
>
>
> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> I think it's already do-able just providing the expected path.
>>
>> It's a good idea especially for the core.
>>
>> Regards
>> JB
>>
>> On 13/07/2018 09:51, Etienne Chauchot wrote:
>> > Hi Udi,
>> >
>> > I also have a question, related to what Eugene asked : I see that the
>> > code paths are the ones of the modules. Can we be more precise than that
>> > to assign reviewers ? As an example, I added myself to runner/core
>> > because I wanted to take a look at the PRs related to
>> > runner/core/metrics but I'm getting assigned to all runner-core PRs. Can
>> > we specify paths like
>> > runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
>> > I know it is a bit too precise so a bit risky, but in that particular
>> > case, I doubt that the path will change.
>> >
>> > Etienne
>> >
>> > Le jeudi 12 juillet 2018 à 16:49 -0700, Eugene Kirpichov a écrit :
>> >> Hi Udi,
>> >>
>> >> I see that the PR was merged - thanks! However it seems to have some
>> >> unintended effects.
>> >>
>> >> On my PR https://github.com/apache/beam/pull/5940 , I assigned a
>> >> reviewer manually, but the moment I pushed a new commit, it
>> >> auto-assigned a lot of other people to it, and I had to remove them.
>> >> This seems like a big inconvenience to me, is there a way to disable
>> this?
>> >>
>> >> Thanks.
>> >>
>> >> On Thu, Jul 12, 2018 at 2:53 PM Udi Meiri > >> <mailto:eh...@google.com>> wrote:
>> >>> :/ That makes it a little less useful.
>> >>>
>> >>> On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson
>> >>> mailto:timrobertson...@gmail.com>> wrote:
>> >>>> Hi Udi
>> >>>>
>> >>>> I asked the GH helpdesk and they confirmed that only people with
>> >>>> write access will actually be automatically chosen.
>> >>>>
>> >>>> It don't expect it should stop us using it, but we should be aware
>> >>>> that there are non-committers also willing to review.
>> >>>>
>> >>>> Thanks,
>> >>>> Tim
>> >>>>
>> >>>> On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin
>> >>>> mailto:mig...@google.com>> wrote:
>> >>>>> Idea looks good in general.
>> >>>>>
>> >>>>> Did you look into ways to keep this file up-to-date? For example we
>> >>>>> can run monthly job to see if owner was active during this period.
>> >>>>>
>> >>>>> --Mikhail
>> >>>>>
>> >>>>> Have feedback <http://go/migryz-feedback>?
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri > >>>>> <mailto:eh...@google.com>> wrote:
>> >>>>>> Thanks all!
>> >>>>>> I'll try to get the file merged today and see how it works out.
>> >>>>>> Please surface any issues, such as with auto-assignment, here or
>> >>>>>> in JIRA.
>> >>>>>>
>> >>>>>> On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot
>> >>>>>> mailto:echauc...@apache.org>> wrote:
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> I added myself as a reviewer for some modules.
>> >>>>>>>
>> >>>>>>> Etienne
>> >>>>>>>
>> >>>>>>> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
>> >>>>>>>> Hi everyone,
>> >>>>>>>>
>> >>>>>>>> I'm proposing to add auto-reviewer-assignment using Github's
>> >>>>>>>> CODEOWNERS mechanism.
>> >>>>>>>> Initial version is
>> >>>>>>>> here: _https://github.com/apache/beam/pull/5909/files_
>> >>>>>>>>
>> >>>>>>>> I need help from the community in determining owners for each
>> >>>>>>>> component.
>> >>>>>>>> Feel free to directly edit the PR (if you have permission) or
>> >>>>>>>> add a comment.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Background
>> >>>>>>>> The idea is to:
>> >>>>>>>> 1. Document good review candidates for each component.
>> >>>>>>>> 2. Help choose reviewers using the auto-assignment mechanism.
>> >>>>>>>> The suggestion is in no way binding.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Udi Meiri

Mention bot looks cool, as it tries to guess the reviewer using blame.
I've written a quick and dirty script that uses only CODEOWNERS.

Its output looks like:
$ python suggest_reviewers.py --pr 5940
INFO:root:Selected reviewer @lukecwik for:
/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
(path_pattern: /runners/core-construction-java*)
INFO:root:Selected reviewer @lukecwik for:
/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
(path_pattern: /runners/core-construction-java*)
INFO:root:Selected reviewer @echauchot for:
/runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
(path_pattern: /runners/core-java*)
INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
(path_pattern: */build.gradle*)
INFO:root:Selected reviewer @lukecwik for:
/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
(path_pattern: *.java)
INFO:root:Selected reviewer @pabloem for:
/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
(path_pattern: /runners/google-cloud-dataflow-java*)
INFO:root:Selected reviewer @lukecwik for:
/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
(path_pattern: /sdks/java/core*)
Suggested reviewers: @echauchot, @lukecwik, @pabloem

Script is in: https://github.com/apache/beam/pull/5951


What does the community think? Do you prefer blame-based or rules-based
reviewer suggestions?

On Fri, Jul 13, 2018 at 11:13 AM Holden Karau  wrote:

> I'm looking at something similar in the Spark project, and while it's now
> archived by FB it seems like something like
> https://github.com/facebookarchive/mention-bot might do what we want. I'm
> going to spin up a version on my K8 cluster and see if I can ask infra to
> add a webhook and if it works for Spark we could ask INFRA to add a second
> webhook for Beam. (Or if the Beam folks are more interested in
> experimenting I can do Beam first as a smaller project and roll with that).
>
> Let me know :)
>
> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov 
> wrote:
>
>> Sounds reasonable for now, thanks!
>> It's unfortunate that Github's CODEOWNERS feature appears to be
>> effectively unusable for Beam but I'd hope that Github might pay attention
>> and fix things if we submit feedback, with us being one of the most active
>> Apache projects - did anyone do this yet / planning to?
>>
>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:
>>
>>> While I like the idea of having a CODEOWNERS file, the Github
>>> implementation is lacking:
>>> 1. Reviewers are automatically assigned at each push.
>>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's PR
>>> 5940).
>>> 3. Non-committers aren't assigned as reviewers.
>>> 4. Non-committers can't change the list of reviewers.
>>>
>>> I propose renaming the file to disable the auto-reviewer assignment
>>> feature.
>>> In its place I'll add a script that suggests reviewers.
>>>
>>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>>>
>>>> Hi Etienne,
>>>>
>>>> Yes you could be as precise as you want. The paths I listed are just
>>>> suggestions. :)
>>>>
>>>>
>>>> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think it's already do-able just providing the expected path.
>>>>>
>>>>> It's a good idea especially for the core.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 13/07/2018 09:51, Etienne Chauchot wrote:
>>>>> > Hi Udi,
>>>>> >
>>>>> > I also have a question, related to what Eugene asked : I see that the
>>>>> > code paths are the ones of the modules. Can we be more precise than
>>>>> that
>>>>> > to assign reviewers ? As an example, I added myself to runner/core
>>>>> > because I wanted to take a look at the PRs related to
>>>>> > runner/core/metrics but I'm getting assigned to all runner-core PRs.
>>>>> Can
>>>>> > we specify paths like
>>>>> > runners/core-java/src/main/java/org/apache/beam/runners/core/metrics
>>>>> ?
>>>>> > I know it is a bit too precise so a bit risky, but in that particular
>>>>> > case, I doubt that the path will chang

Re: Jenkins slowness

2018-03-09 Thread Udi Meiri

Is there a way to see OS load statistics?

On Fri, Mar 9, 2018 at 10:24 AM Jason Kuster <jasonkus...@google.com> wrote:

> Jenkins executors are probably less beefy than your development machine
> and also have two slots, thus are likely running two builds at once,
> causing extra slowness.
>
>
> On Fri, Mar 9, 2018 at 10:18 AM Udi Meiri <eh...@google.com> wrote:
>
>> Hi,
>>
>> Does anybody know why Jenkins hosts take so long to run? For example,
>> beam1 was running beam_PostCommit_Python_Verify and I saw this time for
>> running "tox -e py27":
>>
>> Ran 1535 tests in 403.860s
>>
>> on my workstation I got:
>>
>> Ran 1535 tests in 160.242s
>>
>> Is there any way to troubleshoot this? Each run takes about an hour to
>> run, not including time waiting in the queue.
>>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> <https://goto.google.com/jasonkuster-feedback>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Jenkins slowness

2018-03-09 Thread Udi Meiri

Hi,

Does anybody know why Jenkins hosts take so long to run? For example, beam1
was running beam_PostCommit_Python_Verify and I saw this time for running
"tox -e py27":

Ran 1535 tests in 403.860s

on my workstation I got:

Ran 1535 tests in 160.242s

Is there any way to troubleshoot this? Each run takes about an hour to run,
not including time waiting in the queue.


smime.p7s
Description: S/MIME Cryptographic Signature

Python incremental testing

2018-04-19 Thread Udi Meiri

I cooked up a quick Bazel proof of concept to run Python tests. Bazel's
advantage over Nosetests is that it caches results and does dependency
tracking on tests to know when to rerun them. (I am not trying to replace
Gradle.)

The config is almost working (121 out of 123 tests files work, 2 have
import errors that I don't understand). I'd appreciate it if anyone knows
what's going wrong.

Code: branch , commit


The idea was to use Tox to set up the virtual environment and to use
  bazel test //apache_beam/...
to run the tests in each environment (instead of "python setup.py
nosetests").


smime.p7s
Description: S/MIME Cryptographic Signature

Python postcommit and precommit

2018-03-30 Thread Udi Meiri

Hi,

I noticed that Python precommit runs using this command:
  mvn clean install -pl sdks/python -am -amd
while postcommit invocation is simply a bash script:
  bash sdks/python/run_postcommit.sh

Both run unit tests via Tox, however since the runtime environment setup is
configured in different files (pom.xml vs shell script), they don't always
agree in their results (precommit is currently succeeded while postcommit
is failing).

So my naive question is: why does Python precommit run via Maven/Gradle?
Could we not just use a script like run_postcommit.sh?

(Side note: there's a lot of code/config duplication, such as: pypi package
versions, *.c, *.so, etc. cleanup)


smime.p7s
Description: S/MIME Cryptographic Signature

Jenkins wait times

2018-03-22 Thread Udi Meiri

Hi,
I've been seeing increased wait times on Jenkins. It's frustrating to wait
8h 
for a build, or 4h

for it just to schedule.

Data point:
https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/buildTimeTrend
These builds seem to be timing out after 4 hours. Can we reduce the timeout
to just 1 hour?


smime.p7s
Description: S/MIME Cryptographic Signature

Hi,
I wanted to get feedback about the upcoming Python Pubsub API. It is
currently experimental and only supports reading and writing UTF-8 strings.
My current proposal only concerns reading from Pubsub.

Classes:
- PubsubMessage: encapsulates Pubsub message payload and attributes.

PTransforms:
- ReadMessagesFromPubSub: Outputs elements of type ``PubsubMessage``.

- ReadPayloadsFromPubSub: Outputs elements of type ``str``.

- ReadStringsFromPubSub: Outputs elements of type ``unicode``, decoded from
UTF-8.

Description of common PTransform arguments:
topic: Cloud Pub/Sub topic in the form
"projects//topics/".
If provided, subscription must be None.
subscription: Existing Cloud Pub/Sub subscription to use in the
form "projects//subscriptions/". If not
specified,
a temporary subscription will be created from the specified topic. If
provided, topic must be None.
id_label: The attribute on incoming Pub/Sub messages to use as a unique
record identifier. When specified, the value of this attribute (which
can be any string that uniquely identifies the record) will be used for
deduplication of messages. If not provided, we cannot guarantee
that no duplicate data will be delivered on the Pub/Sub stream. In this
case, deduplication of the stream will be strictly best effort.
timestamp_attribute: Message value to use as element timestamp. If None,
uses message publishing time as the timestamp.
Timestamp values should be in one of two formats:
- A numerical value representing the number of milliseconds since the
Unix
epoch.
- A string in RFC 3339 format. For example,
{@code 2015-10-29T23:41:41.123Z}. The sub-second component of the
timestamp is optional, and digits beyond the first three (i.e., time
units
smaller than milliseconds) will be ignored.

Code:
https://github.com/udim/beam/blob/b981dd618e9e1f667597eec2a91c7265a389c405/sdks/python/apache_beam/io/gcp/pubsub.py
PR: https://github.com/apache/beam/pull/4901

smime.p7s
Description: S/MIME Cryptographic Signature

debugging python jenkins precommit

2018-03-21 Thread Udi Meiri

Hi,
I'm trying to debug a jenkins precommit error for PR #4877.
(Side rant: It's taking a long time to run precommits (between 24m and 3h),
and I don't have access to the jenkins VM to debug things locally.)

Partial log:
py27-cython create:
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_MavenInstall/src/sdks/python/target/.tox/py27-cython
py27-cython installdeps: cython==0.26.1
ERROR: invocation failed (errno 2), args:
['/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_MavenInstall/src/sdks/python/target/.tox/py27-cython/bin/pip',
'install', 'cython==0.26.1'], cwd:
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_MavenInstall/src/sdks/python
[a long traceback]
OSError: [Errno 2] No such file or directory
https://builds.apache.org/job/beam_PreCommit_Python_MavenInstall/3878/console

I can run `mvn install -pl sdks/python -am -amd` on my workstation
and it passes (after manually invoking `python setup.py build`).

The PR in question makes a lot of changes to tox.ini and updates the cython
version.
Any help would be appreciated.


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Jenkins wait times

2018-03-22 Thread Udi Meiri

https://github.com/apache/beam/pull/4942 for anyone following.

On Thu, Mar 22, 2018 at 10:45 AM Lukasz Cwik <lc...@google.com> wrote:

> The maximum passing runtime was 70 mins in Gradle, so it seems reasonable
> to set it to 90 mins.
> For Maven, its about 130 mins so setting it to 150 mins is also reasonable.
>
> It turns out that ~1/8 in builds were aborted due to taking too long in
> both Maven / Java due to tests getting stuck.
> Would be worthwhile to set a maximum runtime per test to prevent this.
>
> Feel free to open a PR and send it my way.
>
>
> On Thu, Mar 22, 2018 at 10:14 AM Udi Meiri <eh...@google.com> wrote:
>
>> Hi,
>> I've been seeing increased wait times on Jenkins. It's frustrating to
>> wait 8h
>> <https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/3265/>
>> for a build, or 4h
>> <https://builds.apache.org/job/beam_PreCommit_Python_MavenInstall/3945/>
>> for it just to schedule.
>>
>> Data point:
>>
>> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/buildTimeTrend
>> These builds seem to be timing out after 4 hours. Can we reduce the
>> timeout to just 1 hour?
>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Automation for Java code formatting

2018-06-28 Thread Udi Meiri

Python already has lint checks: :beam-sdks-python:lint
There are autoformatting tools, which we don't use AFAIK:
https://github.com/myint/autoflake
https://github.com/hhatto/autopep8


On Thu, Jun 28, 2018 at 12:31 AM Daniel Kulp  wrote:

>
>
> On Jun 28, 2018, at 6:01 AM, Kenneth Knowles  wrote:
>
> So, there are a few modes of checkstyle failure that can be induced by
> this:
>
>  - lines get too long because of wrap & indent logic
>  - a lot of our HTML is actually already broken and doesn't have  tags
> where it should; spotless adds a blank line which makes checkstyle notice
> the errors
>
> I think that autoformat outweighs these. I propose that we comment out
> these checkstyle rules, turn on autoformat, then gradually restore the
> rules. I have adjusted my pull request accordingly.
>
>
> Yea… when I was adding all the whitespace rules to checkstyle config, I
> looked at the spotless stuff and could not find a set of rules for
> checkstyle that would 100% meet the spotless output.   I fully expect to
> need to disable a few of the rules.   However, if spotless is run
> automatically so the code is always formatted correctly, checkstyle is
> redundant.
>
>
> --
> Daniel Kulp
> dk...@apache.org - http://dankulp.com/blog
> Talend Community Coder - http://coders.talend.com
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Python SDK: pytest vs nose

2018-06-28 Thread Udi Meiri

Hi,
I'm currently leaning towards migrating us to pytest, since it has features
like --last-failed, is actively maintained, and seems more polished in
general.

(Pytest is supposed to be able to run nose tests, but not all our tests
work with it out of the box.)

Does anyone have a preference for one or the other?

- Udi


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Python SDK: pytest vs nose

2018-06-28 Thread Udi Meiri

https://issues.apache.org/jira/browse/BEAM-3713

On Thu, Jun 28, 2018 at 3:35 PM Udi Meiri  wrote:

> Hi,
> I'm currently leaning towards migrating us to pytest, since it has
> features like --last-failed, is actively maintained, and seems more
> polished in general.
>
> (Pytest is supposed to be able to run nose tests, but not all our tests
> work with it out of the box.)
>
> Does anyone have a preference for one or the other?
>
> - Udi
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: python post-commit failures

2018-10-05 Thread Udi Meiri

More details in https://issues.apache.org/jira/browse/BEAM-5442

On Fri, Oct 5, 2018 at 10:26 AM Udi Meiri  wrote:

> I'm seeing these errors at least in one test:
> "Python sdk harness failed:
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 133, in main
> sdk_pipeline_options.get_all_options(drop_default=True))
>   File
> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/pipeline_options.py",
> line 224, in get_all_options
> parser.add_argument(arg.split('=', 1)[0], nargs='?')
>   File "/usr/lib/python2.7/argparse.py", line 1308, in add_argument
> return self._add_action(action)
>   File "/usr/lib/python2.7/argparse.py", line 1682, in _add_action
> self._optionals._add_action(action)
>   File "/usr/lib/python2.7/argparse.py", line 1509, in _add_action
> action = super(_ArgumentGroup, self)._add_action(action)
>   File "/usr/lib/python2.7/argparse.py", line 1322, in _add_action
> self._check_conflict(action)
>   File "/usr/lib/python2.7/argparse.py", line 1460, in _check_conflict
> conflict_handler(action, confl_optionals)
>   File "/usr/lib/python2.7/argparse.py", line 1467, in
> _handle_conflict_error
> raise ArgumentError(action, message % conflict_string)
> ArgumentError: argument --beam_plugins: conflicting option string(s):
> --beam_plugins"
>
> This looks like https://github.com/apache/beam/pull/6557
>
> On Fri, Oct 5, 2018 at 9:41 AM Boyuan Zhang  wrote:
>
>> Seems like tests failed:
>> test_leader_board_it
>> (apache_beam.examples.complete.game.leader_board_it_test.LeaderBoardIT) ->
>> Bigquery table not found
>> test_game_stats_it
>> (apache_beam.examples.complete.game.game_stats_it_test.GameStatsIT) ->
>> Bigquery table not found
>> streaming related tests -> Assertion errors
>>
>> On Fri, Oct 5, 2018 at 9:33 AM Udi Meiri  wrote:
>>
>>> I'm seeing post-commit failures in :beam-sdks-python:postCommitITTests:
>>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6181/console
>>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6182/console
>>>
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature

python post-commit failures

2018-10-05 Thread Udi Meiri

I'm seeing post-commit failures in :beam-sdks-python:postCommitITTests:
https://builds.apache.org/job/beam_PostCommit_Python_Verify/6181/console
https://builds.apache.org/job/beam_PostCommit_Python_Verify/6182/console


smime.p7s
Description: S/MIME Cryptographic Signature

Re: python post-commit failures

2018-10-05 Thread Udi Meiri

I'm seeing these errors at least in one test:
"Python sdk harness failed:
Traceback (most recent call last):
  File
"/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 133, in main
sdk_pipeline_options.get_all_options(drop_default=True))
  File
"/usr/local/lib/python2.7/dist-packages/apache_beam/options/pipeline_options.py",
line 224, in get_all_options
parser.add_argument(arg.split('=', 1)[0], nargs='?')
  File "/usr/lib/python2.7/argparse.py", line 1308, in add_argument
return self._add_action(action)
  File "/usr/lib/python2.7/argparse.py", line 1682, in _add_action
self._optionals._add_action(action)
  File "/usr/lib/python2.7/argparse.py", line 1509, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
  File "/usr/lib/python2.7/argparse.py", line 1322, in _add_action
self._check_conflict(action)
  File "/usr/lib/python2.7/argparse.py", line 1460, in _check_conflict
conflict_handler(action, confl_optionals)
  File "/usr/lib/python2.7/argparse.py", line 1467, in
_handle_conflict_error
raise ArgumentError(action, message % conflict_string)
ArgumentError: argument --beam_plugins: conflicting option string(s):
--beam_plugins"

This looks like https://github.com/apache/beam/pull/6557

On Fri, Oct 5, 2018 at 9:41 AM Boyuan Zhang  wrote:

> Seems like tests failed:
> test_leader_board_it
> (apache_beam.examples.complete.game.leader_board_it_test.LeaderBoardIT) ->
> Bigquery table not found
> test_game_stats_it
> (apache_beam.examples.complete.game.game_stats_it_test.GameStatsIT) ->
> Bigquery table not found
> streaming related tests -> Assertion errors
>
> On Fri, Oct 5, 2018 at 9:33 AM Udi Meiri  wrote:
>
>> I'm seeing post-commit failures in :beam-sdks-python:postCommitITTests:
>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6181/console
>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6182/console
>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: python post-commit failures

2018-10-05 Thread Udi Meiri

I was sure that we ran some basic Dataflow integration tests in pre-commit,
and that they should have caught this issue.
But then I remembered that we only have those in Java SDK.
I opened this bug to add end-to-end tests to Python pre-commits as well:
https://issues.apache.org/jira/browse/BEAM-5058

The same goes for Flink: we should be running wordcount and
wordcount_streaming integration tests as part of pre-commit tests.

On Fri, Oct 5, 2018 at 1:37 PM Thomas Weise  wrote:

> Fixed. Can someone please take a look at the usage of the --beam_plugins
> flag in the Dataflow runner so that we can address the root cause?
>
> We can probably do more to avoid Friday Python post commit excitement. In
> this case, extra checking was done pre-merge by running the Python VR tests
> for Flink, but the failure occurred with the Dataflow runner.
>
> The changes were pipeline options related, so (pre-existing) test coverage
> should have been better.
>
> But beyond that, we can probably make it easier for contributors and
> reviewers to know what extra checks are available and possibly appropriate
> to run pre-commit. Should we add some pointers to
> https://beam.apache.org/contribute/testing/#pre-commit or is there a
> better place?
>
> Thanks
>
>
>
>
> On Fri, Oct 5, 2018 at 10:38 AM Udi Meiri  wrote:
>
>> More details in https://issues.apache.org/jira/browse/BEAM-5442
>>
>> On Fri, Oct 5, 2018 at 10:26 AM Udi Meiri  wrote:
>>
>>> I'm seeing these errors at least in one test:
>>> "Python sdk harness failed:
>>> Traceback (most recent call last):
>>>   File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py",
>>> line 133, in main
>>> sdk_pipeline_options.get_all_options(drop_default=True))
>>>   File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/pipeline_options.py",
>>> line 224, in get_all_options
>>> parser.add_argument(arg.split('=', 1)[0], nargs='?')
>>>   File "/usr/lib/python2.7/argparse.py", line 1308, in add_argument
>>> return self._add_action(action)
>>>   File "/usr/lib/python2.7/argparse.py", line 1682, in _add_action
>>> self._optionals._add_action(action)
>>>   File "/usr/lib/python2.7/argparse.py", line 1509, in _add_action
>>> action = super(_ArgumentGroup, self)._add_action(action)
>>>   File "/usr/lib/python2.7/argparse.py", line 1322, in _add_action
>>> self._check_conflict(action)
>>>   File "/usr/lib/python2.7/argparse.py", line 1460, in _check_conflict
>>> conflict_handler(action, confl_optionals)
>>>   File "/usr/lib/python2.7/argparse.py", line 1467, in
>>> _handle_conflict_error
>>> raise ArgumentError(action, message % conflict_string)
>>> ArgumentError: argument --beam_plugins: conflicting option string(s):
>>> --beam_plugins"
>>>
>>> This looks like https://github.com/apache/beam/pull/6557
>>>
>>> On Fri, Oct 5, 2018 at 9:41 AM Boyuan Zhang  wrote:
>>>
>>>> Seems like tests failed:
>>>> test_leader_board_it
>>>> (apache_beam.examples.complete.game.leader_board_it_test.LeaderBoardIT) ->
>>>> Bigquery table not found
>>>> test_game_stats_it
>>>> (apache_beam.examples.complete.game.game_stats_it_test.GameStatsIT) ->
>>>> Bigquery table not found
>>>> streaming related tests -> Assertion errors
>>>>
>>>> On Fri, Oct 5, 2018 at 9:33 AM Udi Meiri  wrote:
>>>>
>>>>> I'm seeing post-commit failures in :beam-sdks-python:postCommitITTests:
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6181/console
>>>>>
>>>>> https://builds.apache.org/job/beam_PostCommit_Python_Verify/6182/console
>>>>>
>>>>>
>>>>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] - Separate JIRA notifications to a new mailing list

2018-10-11 Thread Udi Meiri

+1 to split JIRA notifications

On Thu, Oct 11, 2018 at 9:13 AM Kenneth Knowles  wrote:

>
> On Thu, Oct 11, 2018 at 9:10 AM Mikhail Gryzykhin <
> gryzykhin.mikh...@gmail.com> wrote:
>
>> +1.
>> Should we separate Jenkins notifications as well?
>>
>
> I'm worried this question will get buried in the thread. Would you mind
> separating it into another thread if you would like to discuss?
>
> Kenn
>
>
>
>> On Thu, Oct 11, 2018, 08:59 Scott Wegner >
>>> +1, commits@ is too noisy to be useful currently.
>>>
>>> On Thu, Oct 11, 2018 at 8:04 AM Maximilian Michels 
>>> wrote:
>>>
 +1

 I guess most people have already filters in place to separate commits
 and JIRA issues. JIRA really has nothing to do in the commits list.

 On 11.10.18 15:53, Kenneth Knowles wrote:
 > +1
 >
 > I've suggested the same. Canonical.
 >
 > On Thu, Oct 11, 2018, 06:19 Thomas Weise >>> > > wrote:
 >
 > +1
 >
 >
 > On Thu, Oct 11, 2018 at 6:18 AM Etienne Chauchot
 > mailto:echauc...@apache.org>> wrote:
 >
 > +1 for me also, my gmail filters list is kind of overflowed :)
 >
 > Etienne
 >
 > Le jeudi 11 octobre 2018 à 14:44 +0200, Robert Bradshaw a
 écrit :
 >> Huge +1 from me too.
 >> On Thu, Oct 11, 2018 at 2:42 PM Jean-Baptiste Onofré <
 j...@nanthrax.net > wrote:
 >>
 >> +1
 >>
 >> We are doing the same in Karaf as well.
 >>
 >> Regards
 >> JB
 >>
 >> On 11/10/2018 14:35, Colm O hEigeartaigh wrote:
 >> Hi all,
 >>
 >> Apologies in advance if this has already been discussed (and
 rejected).
 >> I was wondering if it would be a good idea to create a new
 mailing list
 >> and divert the JIRA notifications to it? Currently
 >> "comm...@beam.apache.org 
 >"
 receives both
 >> the git and JIRA notifications, and has a huge volume of
 traffic as a
 >> result.
 >>
 >> Separating JIRA notifications from commit messages would
 allow users to
 >> subscribe to whichever are of interest without having to
 write a mail
 >> filter if e.g. they are not interested in JIRA
 notifications. It also
 >> seems a bit unintuitive to me to expect JIRA notifications
 to go to an
 >> email list called "commits".
 >>
 >> As a reference point - Apache CXF maintains a "commits" list
 for git
 >> notifications and "issues" for JIRA notifications:
 >>
 >> http://cxf.apache.org/mailing-lists.html
 >>
 >> Thanks!
 >>
 >> Colm.
 >>
 >> --
 >> Colm O hEigeartaigh
 >>
 >> Talend Community Coder
 >> http://coders.talend.com
 >>
 >> --
 >> Jean-Baptiste Onofré
 >> jbono...@apache.org 
 >> http://blog.nanthrax.net
 >> Talend -http://www.talend.com
 >

>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Gradle for the build ?

2018-10-11 Thread Udi Meiri

I agree with the points made that our Gradle configuration is too
complicated (learning Groovy and BeamModulePlugin come to mind).

But I mainly write for Python SDK, and I've done experiments in the past
with Bazel builds, which was a much better fit for Python as an incremental
and parallel build and test system.

On Thu, Oct 11, 2018 at 4:56 AM Alexey Romanenko 
wrote:

> I agree with all previous points related to Gradle-related problems. Among
> all others, I’d highlight IDE integration issue since it can have real
> negative impact on attracting new contributors and, finally, community
> grow. The more obstacles there are to start code contribution, the more
> chances that it will distract newcomers.
>
> In this way I see two points to do:
> 1) Continue working on IDE integration improvement. I hope there are no
> principal barriers for that and it’s only a question of applied forces,
> otherwise, it can be a very serious issue.
> 2) Improve our “Get started” documentation for contributors to make the
> process of first-time contribution as smooth and simple as possible.
>
>
> On 11 Oct 2018, at 00:22, Tim Robertson  wrote:
>
> Thank you JB for starting this discussion.
>
> Others comment on many of these points far better than I can, but my
> experience is similar to JB.
>
> 1. IDEA integration (and laptop slowing like crazy) being the biggest
> contributor to my feeling of being unproductive
> 2. Not knowing the correct way to modify the build scripts which I put
> down to my own limitations
>
> It seems we also need to help build Gradle expertise in our community, so
>> that those that are motivated are empowered to contribute.
>
>
> Nicely phrased. +1
>
>
>
> On Wed, Oct 10, 2018 at 7:15 PM Scott Wegner  wrote:
>
>> > Perhaps we should go through and prioritize (and add missing items to)
>> BEAM-4045
>>
>> +1. It's hard to know where to start when there's such a laundry list of
>> tasks. If you're having build issues, will you make sure it is represented
>> in BEAM-4045, and "Vote" for the issues that you believe are the highest
>> priority?
>>
>> I agree that the Gradle build is far from perfect (my top gripes are IDE
>> integration and parallel/incremental build support). I believe that we're
>> capable of making our build great, and continuing our investment in Gradle
>> would be a shorter path than changing course again. Remember that our Maven
>> build also had it's share of issues, which is why we as a community voted
>> to replace it [1][2].
>>
>> It seems we also need to help build Gradle expertise in our community, so
>> that those that are motivated are empowered to contribute. Does anybody
>> have a good "Getting Started with Gradle" guide they recommend? Perhaps we
>> could also link to it from the website/wiki.
>>
>> [1]
>> https://lists.apache.org/thread.html/225dddcfc78f39bbb296a0d2bbef1caf37e17677c7e5573f0b6fe253@%3Cdev.beam.apache.org%3E
>> [2]
>> https://lists.apache.org/thread.html/bd399ecb17cd211be7c6089b562c09ba9116649c9eabe3b609606a3b@%3Cdev.beam.apache.org%3E
>>
>> On Wed, Oct 10, 2018 at 2:40 AM Robert Bradshaw 
>> wrote:
>>
>>> Some rough stats (because I was curious): The gradle files have been
>>> edited by ~79 unique contributors over 696 distinct commits, whereas the
>>> maven ones were edited (over a longer time period) by ~130 unique
>>> contributors over 1389 commits [1]. This doesn't capture how much effort
>>> was put into these edits, but neither is restricted to a small set of
>>> experts.
>>>
>>> Regarding "friendly for other languages" I don't think either is
>>> necessarily easy to learn, but my impression is that the maven learning
>>> curve shallower for those already firmly embedded in the Java ecosystem
>>> (perhaps due to leveraging existing familiarity, and perhaps some due to
>>> the implicit java-centric conventions that maven assumed about your
>>> project), whereas with gradle at least I could keep pulling on the string
>>> to unwind things to the bottom. The "I just want to build/test X without
>>> editing/viewing the build files" seemed more natural with Gradle (e.g. I
>>> can easily list all tasks).
>>>
>>> That being said, I don't think everyone needs to understand the full
>>> build system. It's important that there be a critical mass that do (we have
>>> that for both, and if we can simplify to improve this that'd be great),
>>> it's easy enough to do basic changes (e.g. add a dependency, again I don't
>>> think the barrier is sufficiently different for either), and works well out
>>> of the box for someone who just wants to look up a command on the website
>>> and edit code (the CLI is an improvement with Gradle, but it's clear that
>>> (java) IDE support is a significant regression).
>>>
>>> Personally, I don't know much about IDE configuration (admittedly the
>>> larger issue), but one action item I can take on is trying to eliminate the
>>> need to do a "git clean" after building certain targets (assuming I can
>>> reproduce this).
>>>

post-commit failure emails

2018-10-11 Thread Udi Meiri

Hi,
https://github.com/apache/beam/pull/6635 is an attempt to notify commit
authors if their commit is suspected to have broken post-commit test.

I'd would like to get some feedback about this feature.
Is it accurate?
Is it spammy?

Thanks!


smime.p7s
Description: S/MIME Cryptographic Signature

Re: post-commit failure emails

2018-10-11 Thread Udi Meiri

 The email trigger is setup to trigger on the 1st failure (after last green
I presume), and to send an email to "suspects" of the first failing build
(authors and build triggerers).
If a new job's first build is failing it should send an email, but only to
suspects.

https://github.com/jenkinsci/email-ext-plugin/blob/master/src/main/java/hudson/plugins/emailext/plugins/trigger/FirstFailureTrigger.java
https://github.com/jenkinsci/email-ext-plugin/blob/9012fcd1b9b2040f2d34c2723198a751eb81cce5/src/main/java/hudson/plugins/emailext/plugins/recipients/FirstFailingBuildSuspectsRecipientProvider.java#L100

On Thu, Oct 11, 2018 at 2:48 PM Andrew Pilloud  wrote:

> There are numerous reports on this list of Jenkins emails being spammy on
> new jobs. The list of emails generated on the first run is everyone who has
> ever had a commit merged to Beam. I don't suppose that would be a problem
> if this doesn't get turned on until at least one run has passed on each job.
>
> Andrew
>
> On Thu, Oct 11, 2018 at 2:44 PM Udi Meiri  wrote:
>
>> Hi,
>> https://github.com/apache/beam/pull/6635 is an attempt to notify commit
>> authors if their commit is suspected to have broken post-commit test.
>>
>> I'd would like to get some feedback about this feature.
>> Is it accurate?
>> Is it spammy?
>>
>> Thanks!
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

[PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-15 Thread Udi Meiri

Hi,

In light of increasing Python pre-commit times due to the added Python 3
tests,
I thought it might be time to re-evaluate the tools used for Python tests
and development, and propose an alternative.

Currently, we use nosetests, tox, and virtualenv for testing.
The proposal is to use Bazel, which I believe can replace the above tools
while adding:
- parallel testing: each target has its own build directory, providing
isolation from build artifacts such as from Cython
- incremental testing: it is possible to precisely define each test's
dependencies

There's also a requirement to test against specific Python versions, such
as 2.7 and 3.4.
This could be done using docker containers having the precise version of
interpreter and Bazel.

In summary:
Bazel could replace the need for virtualenv, tox, and nosetests.
The addition of Docker images would allow testing against specific Python
versions.


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-17 Thread Udi Meiri

On Wed, Oct 17, 2018 at 1:38 AM Robert Bradshaw  wrote:

> On Tue, Oct 16, 2018 at 12:48 AM Udi Meiri  wrote:
>
>> Hi,
>>
>> In light of increasing Python pre-commit times due to the added Python 3
>> tests,
>> I thought it might be time to re-evaluate the tools used for Python tests
>> and development, and propose an alternative.
>>
>> Currently, we use nosetests, tox, and virtualenv for testing.
>> The proposal is to use Bazel, which I believe can replace the above tools
>> while adding:
>> - parallel testing: each target has its own build directory,
>>
>
> We could look at detox and/or retox again to get parallel testing (though
> not, possibly, at such a low level). We tried detox for a while, but there
> were issues debugging timeouts (specifically, it buffered the stdout while
> testing to avoid multiplexing it, but that meant little info when a test
> actually timed out on jenkins).
>
> We could alternatively look into leveraging gradle's within-project
> paralleliziaton to speed this up. It is a pain that right now every Python
> test is run sequentially.
>
I don't believe that Gradle has an easy solution. The only within-project
parallilization I can find requires using the Worker API
<https://guides.gradle.org/using-the-worker-api/?_ga=2.143780085.1705314017.1539791984-819557858.1539791984>
.

I've tried pytest-xdist with limited success (pickling the session with
pytest-xdist's execnet dependency fails).


>
>
>> providing isolation from build artifacts such as from Cython
>>
>
> Each tox environment already has (I think) its own build directory. Or is
> this not what we're seeing?
>
Cython-based unit test runs leave behind artifacts that must be cleaned up,
which is why we can't run all tox environments in parallel.
We use this script to clean up:
https://github.com/apache/beam/blob/master/sdks/python/scripts/run_tox_cleanup.sh


I am 90% certain that this would not be an issue with bazel, since it
stages all build dependencies in a separate build directory, so any
generated files would be placed there.


>
>> - incremental testing: it is possible to precisely define each test's
>> dependencies
>>
>
> This is a big plus. It would allow us to enforce non-dependence on
> non-dependencies as well.
>
>
>> There's also a requirement to test against specific Python versions, such
>> as 2.7 and 3.4.
>> This could be done using docker containers having the precise version of
>> interpreter and Bazel.
>>
>
> I'm generally -1 on requiring docker to run our unittests.
>
You would still run unit tests using Bazel (in terminal or with IDE
integration, or even directly).
Docker would be used to verify they pass on specific Python versions. (2.7,
3.4, 3.5, 3.6)
I don't know how to maintain multiple Python versions on my workstation,
let alone on Jenkins.


>
>
>> In summary:
>> Bazel could replace the need for virtualenv, tox, and nosetests.
>> The addition of Docker images would allow testing against specific Python
>> versions.
>>
>
>
>  To be clear, I really like Bazel, and would have liked to see it for our
> top-level build, but there were some problems that were never adequately
> addressed.
>
> (1) There were difficulties managing upstream dependencies correctly.
> Perhaps there has been some improvement upstream since we last looked at
> this (it was fairly new), and perhaps it's not as big a deal in Python, but
> this was the blocker for using it for Beam as a whole.
> (2) Bazel still has poor support for C (including Cython) extensions.
> (3) It's unclear how this would interact with setup.py. Would we keep
> both, using one for testing and the other for releases (sdist, wheels)?
>
> There's also the downside of introducing yet another build tool that's not
> familiar to the Python community, rather than sticking with the "standard"
> ones.
>
> I would, however, be interested in hearing others' thoughts on this
> proposal.
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Python Datastore client upgrade plan

2018-10-16 Thread Udi Meiri

Hi,
Sadly upgrading googledatastore -> google-cloud-datastore is non-trivial (
https://issues.apache.org/jira/browse/BEAM-4543). I wrote a doc to
summarize the plan:
https://docs.google.com/document/d/1sL9p7NE5Z0p-5SB5uwpxWrddj_UCESKSrsvDTWNKqb4/edit?usp=sharing

Contents pasted below:
Beam Python SDK: Datastore Client Upgrade

eh...@google.com

public, draft, 2018-10-16
Objective

Upgrade Beam's Python SDK dependency to use google-cloud-datastore v1.70
(or later), replacing googledatastore v7.0.1, providing Beam users a
migration path to a new Datastore PTransform API.
Background

Beam currently uses the googledatastore package to provide access to Google
Cloud Datastore, however that package doesn't seem to be getting regular
releases (last release in 2017-04
) and it doesn't officially
support Python 3 .

The current Beam API for Datastore queries exposes googledatastore types,
such as using a protobuf to define a query (wordcount example
).
Conversely, google-cloud-datastore hides this implementation detail (query
API
).
Since Beam API has to change the data types it accepts, it forces users to
change their code. This makes the migration to google-cloud-datastore
non-trivial.
Proposal

This proposal includes a period in which two Beam APIs are available to
access Datastore.


   -

   Add a new PTransforms that use google-cloud-datastore and mark as
   deprecated the existing API (ReadFromDatastore, WriteToDatastore,
   DeleteFromDatastore).
   -

   Implement apache_beam/io/datastore.py using google-cloud-datastore,
   taking care to not expose Datastore client internals.
   -

   (optional) Remove googledatastore from GCP_REQUIREMENTS
   

   package list, and add it to a separate list, e.g., pip install
   apache-beam[gcp,googledatastore].
   -

   Remove googledatastore-based API from Beam after 2 releases.


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Udi Meiri

Unless I'm missing something, we can't run cython builds in parallel with
other builds without first making cython build artifacts appear in a
separate build directory.

AFAICT --parallel--safe-build is deprecated
<https://github.com/tox-dev/tox/pull/1032/files> / on by default.

I vote to move the python 2 test suite to postcommit, since python
2 + cython seems to already test the non-gcp use case with regards to hard
dependencies on gcp.


On Thu, Oct 18, 2018 at 7:00 AM Robert Bradshaw  wrote:

> On Wed, Oct 17, 2018 at 10:57 PM Lukasz Cwik  wrote:
>
>> Gradle works pretty well at executing separate projects in parallel.
>> There are a few Java projects which contain only tests with different flags
>> which allow us to use the Gradle project based parallelization effectively.
>> See
>> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/examples/build.gradle
>> and
>> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/examples-streaming/build.gradle
>> since it runs the same set of tests, one with --streaming and the other
>> without. This should be able to work for Python as well.
>>
>
> +1, that's a great idea. Together with --parallel--safe-build should be
> sufficient.
>
> We could separately look into whether it's worth adding annotations
> (positive or negative) to mark tests which have low value to be run in all
> the different environments.
>
> On Wed, Oct 17, 2018 at 10:17 AM Udi Meiri  wrote:
>>
>>> On Wed, Oct 17, 2018 at 1:38 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On Tue, Oct 16, 2018 at 12:48 AM Udi Meiri  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In light of increasing Python pre-commit times due to the added Python
>>>>> 3 tests,
>>>>> I thought it might be time to re-evaluate the tools used for Python
>>>>> tests and development, and propose an alternative.
>>>>>
>>>>> Currently, we use nosetests, tox, and virtualenv for testing.
>>>>> The proposal is to use Bazel, which I believe can replace the above
>>>>> tools while adding:
>>>>> - parallel testing: each target has its own build directory,
>>>>>
>>>>
>>>> We could look at detox and/or retox again to get parallel testing
>>>> (though not, possibly, at such a low level). We tried detox for a while,
>>>> but there were issues debugging timeouts (specifically, it buffered the
>>>> stdout while testing to avoid multiplexing it, but that meant little info
>>>> when a test actually timed out on jenkins).
>>>>
>>>> We could alternatively look into leveraging gradle's within-project
>>>> paralleliziaton to speed this up. It is a pain that right now every Python
>>>> test is run sequentially.
>>>>
>>> I don't believe that Gradle has an easy solution. The only
>>> within-project parallilization I can find requires using the Worker API
>>> <https://guides.gradle.org/using-the-worker-api/?_ga=2.143780085.1705314017.1539791984-819557858.1539791984>
>>> .
>>>
>>> I've tried pytest-xdist with limited success (pickling the session with
>>> pytest-xdist's execnet dependency fails).
>>>
>>>
>>>>
>>>>
>>>>> providing isolation from build artifacts such as from Cython
>>>>>
>>>>
>>>> Each tox environment already has (I think) its own build directory. Or
>>>> is this not what we're seeing?
>>>>
>>> Cython-based unit test runs leave behind artifacts that must be cleaned
>>> up, which is why we can't run all tox environments in parallel.
>>> We use this script to clean up:
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/scripts/run_tox_cleanup.sh
>>>
>>>
>>> I am 90% certain that this would not be an issue with bazel, since it
>>> stages all build dependencies in a separate build directory, so any
>>> generated files would be placed there.
>>>
>>>
>>>>
>>>>> - incremental testing: it is possible to precisely define each test's
>>>>> dependencies
>>>>>
>>>>
>>>> This is a big plus. It would allow us to enforce non-dependence on
>>>> non-dependencies as well.
>>>>
>>>>
>>>>> There's also a requirement to test against specific Python versions,
>>>>> such as 2.7 and 3.4.
>>>>> This could b

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-15 Thread Udi Meiri

+1 for explicit --runner_option=param=val,...
It's hard to tell otherwise where an option is going to,

On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw  wrote:

> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels  wrote:
> >
> > I agree that the current approach breaks the pipeline options contract
> > because "unknown" options get parsed in the same way as options which
> > have been defined by the user.
>
> FWIW, I think we're already breaking this "contract." Unknown options
> are silently ignored; with this change we just change how we record
> them. It still feels a bit hacky though.
>
> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
> > true/false flags. We want to pass all types of pipeline options to the
> > Runner.
>
> Experiments is an arbitrary set of strings, which can be of the form
> "param=value" if that's useful. (Dataflow does this.) There is, again,
> no namespacing on the param names, but we could user urns or impose
> some other structure here.
>
> > How to solve this?
> >
> > 1) Add all options of all Runners to each SDK
> > We added some of the FlinkRunner options to the Python SDK but realized
> > syncing is rather cumbersome in the long term. However, we want the most
> > important options to be validated on the client side.
>
> I don't think this is sustainable in the long run. However, thinking
> about this, in the worse case validation happens after construction
> but before execution (as with much of our other validation) so it
> isn't that bad.
>
> > 2) Pass "unknown" options via a separate list in the Proto which can
> > only be accessed internally by the Runners. This still allows passing
> > arbitrary options but we wouldn't leak unknown options and display them
> > as top-level options.
>
> I think there needs to be a way for the user to communicate values
> directly to the runner regardless of the SDK. My preference would be
> to make this explicit, e.g. (repeated) --runner_option=..., rather
> than scooping up all unknown flags at command line parsing time.
> Perhaps an SDK that is aware of some runners could choose to lift
> these as top-level options, but still pass them as runner options.
>
> > On 13.10.18 02:34, Charles Chen wrote:
> > > The current release branch
> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after
> the
> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as
> a
> > > revert of the revert.  Regarding your comment above, I can help out
> with
> > > the design / PR reviews for common Python code as you suggest.
> > >
> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise  > > > wrote:
> > >
> > > Thanks, will tag you and looking forward to feedback so we can
> > > ensure that changes work for everyone.
> > >
> > > Looking at the PR, I see agreement from Max to revert the change on
> > > the release branch, but not in master. Would you mind to restore it
> > > in master?
> > >
> > > Thanks
> > >
> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay  > > > wrote:
> > >
> > >
> > >
> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen  > > > wrote:
> > >
> > > What I mean is that a user may find that it works for them
> > > to pass "--myarg blah" and access it as "options.myarg"
> > > without explicitly defining a "my_arg" flag due to the
> added
> > > logic.  This is not the intended behavior and we may want
> to
> > > change this implementation detail in the future.  However,
> > > having this logic in a released version makes it hard to
> > > change this behavior since users may erroneously depend on
> > > this undocumented behavior.  Instead, we should namespace /
> > > scope this so that it is obvious that this is meant for
> > > runner (and not Beam user) consumption.
> > >
> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> > > mailto:t...@apache.org>> wrote:
> > >
> > > Can you please elaborate more what practical problems
> > > this introduces for users?
> > >
> > > I can see that this change allows a user to specify a
> > > runner specific option, which in the future may change
> > > because we decide to scope differently. If this only
> > > affects users of the portable Flink runner (like us),
> > > then no need to revert, because at this early stage we
> > > prefer something that works over being blocked.
> > >
> > > It would also be really great if some of the core
> Python
> > > SDK developers could help out with the design aspects
> > > and PR reviews of changes that affect common Python
> > > code. Anyone who specifically wants

Re: error with DirectRunner

2018-10-29 Thread Udi Meiri

This looks like a FnApiRunner bug.
When I override use_fnapi_runner = False in direct_runner.py the pipeline
works.

It seems like either the side-input to _copy_number or the Flatten
operation is the culprit.

On Mon, Oct 29, 2018 at 2:37 PM Allie Chen  wrote:

> Hi,
>
> I have a project that started failing with DirectRunner, but works well
> using DataflowRunner (last working version is 2.4). The error message I
> received are:
> line 1088, in run_stage
>   pipeline_components.pcollections[actual_pcoll_id].coder_id]]
> KeyError: u'ref_Coder_WindowedValueCoder_1'
>
> I have simplified the pipeline to the following example. Can someone
> please take a look? Many thanks!
>
> Allie
>
>
> import apache_beam as beam
> import argparse
> from apache_beam import transforms
> from apache_beam import pvalue
> from apache_beam.options import pipeline_options
>
>
> def _copy_number(number, side=None):
>   yield number
>
>
> def fn_sum(values):
>   return sum(values)
>
>
> def run(argv=None):
>   parser = argparse.ArgumentParser()
>   _, pipeline_args = parser.parse_known_args(argv)
>   options = pipeline_options.PipelineOptions(pipeline_args)
>   numbers = [1, 2]
>   with beam.Pipeline(options=options) as p:
> sum_1 = (p
>  | 'ReadNumber1' >> transforms.Create(numbers)
>  | 'CalculateSum1' >> beam.CombineGlobally(fn_sum))
>
> sum_2 = (p
>  | 'ReadNumber2' >> transforms.Create(numbers)
>  | beam.ParDo(_copy_number, pvalue.AsSingleton(sum_1))
>  | 'CalculateSum2' >> beam.CombineGlobally(fn_sum))
>
> _ = ((sum_1, sum_2)
>  | beam.Flatten()
>  | 'CalculateSum3' >> beam.CombineGlobally(fn_sum)
>  | beam.io.WriteToText('gs://BUCKET/sum'))
>
>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: error with DirectRunner

2018-10-30 Thread Udi Meiri

Created https://issues.apache.org/jira/browse/BEAM-5927

On Tue, Oct 30, 2018 at 1:13 PM Lukasz Cwik  wrote:

> Udi, do you know if we have a bug tracking this issue?
>
> If not, can you file one referencing this e-mail thread?
>
> On Tue, Oct 30, 2018 at 6:33 AM Allie Chen  wrote:
>
>> Thanks Udi. I agree, since it works fine removing either the side input
>> or the last flatten and combine operation.
>>
>> On Mon, Oct 29, 2018 at 9:02 PM Udi Meiri  wrote:
>>
>>> This looks like a FnApiRunner bug.
>>> When I override use_fnapi_runner = False in direct_runner.py the
>>> pipeline works.
>>>
>>> It seems like either the side-input to _copy_number or the Flatten
>>> operation is the culprit.
>>>
>>> On Mon, Oct 29, 2018 at 2:37 PM Allie Chen 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a project that started failing with DirectRunner, but works well
>>>> using DataflowRunner (last working version is 2.4). The error message I
>>>> received are:
>>>> line 1088, in run_stage
>>>>   pipeline_components.pcollections[actual_pcoll_id].coder_id]]
>>>> KeyError: u'ref_Coder_WindowedValueCoder_1'
>>>>
>>>> I have simplified the pipeline to the following example. Can someone
>>>> please take a look? Many thanks!
>>>>
>>>> Allie
>>>>
>>>>
>>>> import apache_beam as beam
>>>> import argparse
>>>> from apache_beam import transforms
>>>> from apache_beam import pvalue
>>>> from apache_beam.options import pipeline_options
>>>>
>>>>
>>>> def _copy_number(number, side=None):
>>>>   yield number
>>>>
>>>>
>>>> def fn_sum(values):
>>>>   return sum(values)
>>>>
>>>>
>>>> def run(argv=None):
>>>>   parser = argparse.ArgumentParser()
>>>>   _, pipeline_args = parser.parse_known_args(argv)
>>>>   options = pipeline_options.PipelineOptions(pipeline_args)
>>>>   numbers = [1, 2]
>>>>   with beam.Pipeline(options=options) as p:
>>>> sum_1 = (p
>>>>  | 'ReadNumber1' >> transforms.Create(numbers)
>>>>  | 'CalculateSum1' >> beam.CombineGlobally(fn_sum))
>>>>
>>>> sum_2 = (p
>>>>  | 'ReadNumber2' >> transforms.Create(numbers)
>>>>  | beam.ParDo(_copy_number, pvalue.AsSingleton(sum_1))
>>>>  | 'CalculateSum2' >> beam.CombineGlobally(fn_sum))
>>>>
>>>> _ = ((sum_1, sum_2)
>>>>  | beam.Flatten()
>>>>  | 'CalculateSum3' >> beam.CombineGlobally(fn_sum)
>>>>  | beam.io.WriteToText('gs://BUCKET/sum'))
>>>>
>>>>
>>>>
>>>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: error with DirectRunner

2018-10-30 Thread Udi Meiri

+Robert Bradshaw  I would be happy to debug and fix
this, but I'd need more guidance on where to look.

On Tue, Oct 30, 2018 at 4:07 PM Udi Meiri  wrote:

> Created https://issues.apache.org/jira/browse/BEAM-5927
>
> On Tue, Oct 30, 2018 at 1:13 PM Lukasz Cwik  wrote:
>
>> Udi, do you know if we have a bug tracking this issue?
>>
>> If not, can you file one referencing this e-mail thread?
>>
>> On Tue, Oct 30, 2018 at 6:33 AM Allie Chen  wrote:
>>
>>> Thanks Udi. I agree, since it works fine removing either the side input
>>> or the last flatten and combine operation.
>>>
>>> On Mon, Oct 29, 2018 at 9:02 PM Udi Meiri  wrote:
>>>
>>>> This looks like a FnApiRunner bug.
>>>> When I override use_fnapi_runner = False in direct_runner.py the
>>>> pipeline works.
>>>>
>>>> It seems like either the side-input to _copy_number or the Flatten
>>>> operation is the culprit.
>>>>
>>>> On Mon, Oct 29, 2018 at 2:37 PM Allie Chen 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a project that started failing with DirectRunner, but works
>>>>> well using DataflowRunner (last working version is 2.4). The error message
>>>>> I received are:
>>>>> line 1088, in run_stage
>>>>>   pipeline_components.pcollections[actual_pcoll_id].coder_id]]
>>>>> KeyError: u'ref_Coder_WindowedValueCoder_1'
>>>>>
>>>>> I have simplified the pipeline to the following example. Can someone
>>>>> please take a look? Many thanks!
>>>>>
>>>>> Allie
>>>>>
>>>>>
>>>>> import apache_beam as beam
>>>>> import argparse
>>>>> from apache_beam import transforms
>>>>> from apache_beam import pvalue
>>>>> from apache_beam.options import pipeline_options
>>>>>
>>>>>
>>>>> def _copy_number(number, side=None):
>>>>>   yield number
>>>>>
>>>>>
>>>>> def fn_sum(values):
>>>>>   return sum(values)
>>>>>
>>>>>
>>>>> def run(argv=None):
>>>>>   parser = argparse.ArgumentParser()
>>>>>   _, pipeline_args = parser.parse_known_args(argv)
>>>>>   options = pipeline_options.PipelineOptions(pipeline_args)
>>>>>   numbers = [1, 2]
>>>>>   with beam.Pipeline(options=options) as p:
>>>>> sum_1 = (p
>>>>>  | 'ReadNumber1' >> transforms.Create(numbers)
>>>>>  | 'CalculateSum1' >> beam.CombineGlobally(fn_sum))
>>>>>
>>>>> sum_2 = (p
>>>>>  | 'ReadNumber2' >> transforms.Create(numbers)
>>>>>  | beam.ParDo(_copy_number, pvalue.AsSingleton(sum_1))
>>>>>  | 'CalculateSum2' >> beam.CombineGlobally(fn_sum))
>>>>>
>>>>> _ = ((sum_1, sum_2)
>>>>>  | beam.Flatten()
>>>>>  | 'CalculateSum3' >> beam.CombineGlobally(fn_sum)
>>>>>  | beam.io.WriteToText('gs://BUCKET/sum'))
>>>>>
>>>>>
>>>>>
>>>>>


smime.p7s
Description: S/MIME Cryptographic Signature

BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Udi Meiri

HI,
I've identified a memory leak when GcsUtil.java instantiates a
ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
The code uses the getExitingExecutorService

wrapper,
which leaks memory. The question is, why is that wrapper necessary
if executor.shutdown(); is later unconditionally called?


smime.p7s
Description: S/MIME Cryptographic Signature

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Udi Meiri

My thought was to use 1 executor per GcsUtil instance (or 1 per process as
you suggest), with a possible performance hit since I don't know how often
these batch copy and remove operations are called.
The other option is to leave things as they mostly are, and only remove the
call to getExitingExecutorService.

Both options risk delaying worker shutdown if the executor's shutdown()
hasn't been called, which is I guess why the executor in GcsOptions.java
creates daemon threads.

On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:

> Not certain, it looks like we should have been caching the executor within
> the GcsUtil as a static instance instead of creating one each time. Could
> have been missed during code review / slow code changes over time. GcsUtil
> is not well "loved".
>
> On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:
>
>> HI,
>> I've identified a memory leak when GcsUtil.java instantiates a
>> ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
>> The code uses the getExitingExecutorService
>> <https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/util/GcsUtil.java#L551>
>>  wrapper,
>> which leaks memory. The question is, why is that wrapper necessary
>> if executor.shutdown(); is later unconditionally called?
>>
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Please do not merge Python PRs

2018-11-15 Thread Udi Meiri

All clear, Python tests are reporting errors correctly again.

On Wed, Nov 14, 2018 at 5:57 PM Udi Meiri  wrote:

> https://github.com/apache/beam/pull/7048 is the rollback PR
>
> On Wed, Nov 14, 2018 at 5:28 PM Ahmet Altay  wrote:
>
>> Thank you Udi. Could you send a rollback PR?
>>
>> I believe this is https://issues.apache.org/jira/browse/BEAM-6048
>>
>> On Wed, Nov 14, 2018 at 5:16 PM, Udi Meiri  wrote:
>>
>>> It seems that Gradle is not getting the correct exit status from test
>>> runs.
>>> Possible culprit: https://github.com/apache/beam/pull/6903
>>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Please do not merge Python PRs

2018-11-14 Thread Udi Meiri

It seems that Gradle is not getting the correct exit status from test runs.
Possible culprit: https://github.com/apache/beam/pull/6903


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Please do not merge Python PRs

2018-11-14 Thread Udi Meiri

https://github.com/apache/beam/pull/7048 is the rollback PR

On Wed, Nov 14, 2018 at 5:28 PM Ahmet Altay  wrote:

> Thank you Udi. Could you send a rollback PR?
>
> I believe this is https://issues.apache.org/jira/browse/BEAM-6048
>
> On Wed, Nov 14, 2018 at 5:16 PM, Udi Meiri  wrote:
>
>> It seems that Gradle is not getting the correct exit status from test
>> runs.
>> Possible culprit: https://github.com/apache/beam/pull/6903
>>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Please do not merge Python PRs

2018-11-14 Thread Udi Meiri

Recreated locally: https://gradle.com/s/psqgcywnc3h2m

On Wed, Nov 14, 2018 at 5:16 PM Udi Meiri  wrote:

> It seems that Gradle is not getting the correct exit status from test runs.
> Possible culprit: https://github.com/apache/beam/pull/6903
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-09 Thread Udi Meiri

The reasoning unbounded threadpool is explained as:
/* The SDK requires an unbounded thread pool because a step may create X
writers
* each requiring their own thread to perform the writes otherwise a writer
may
* block causing deadlock for the step because the writers buffer is full.
* Also, the MapTaskExecutor launches the steps in reverse order and
completes
* them in forward order thus requiring enough threads so that each step's
writers
* can be active.

*/

https://github.com/apache/beam/blob/17c2da6d981cae9f233aea1e2d6d64259362dd73/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.java#L133-L138

On Thu, Nov 8, 2018 at 11:41 PM Dan Halperin  wrote:

>
>> On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri  wrote:
>>
>>> Both options risk delaying worker shutdown if the executor's shutdown()
>>> hasn't been called, which is I guess why the executor in GcsOptions.java
>>> creates daemon threads.
>>>
>>
> My guess (and it really is a guess at this point) is that this was a fix
> for DirectRunner issues - want that to exit quickly!
>
>
>>
>>> On Thu, Nov 8, 2018 at 1:02 PM Lukasz Cwik  wrote:
>>>
>>>> Not certain, it looks like we should have been caching the executor
>>>> within the GcsUtil as a static instance instead of creating one each time.
>>>> Could have been missed during code review / slow code changes over time.
>>>> GcsUtil is not well "loved".
>>>>
>>>> On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri  wrote:
>>>>
>>>>> HI,
>>>>> I've identified a memory leak when GcsUtil.java instantiates a
>>>>> ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018).
>>>>> The code uses the getExitingExecutorService
>>>>> <https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/util/GcsUtil.java#L551>
>>>>>  wrapper,
>>>>> which leaks memory. The question is, why is that wrapper necessary
>>>>> if executor.shutdown(); is later unconditionally called?
>>>>>
>>>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Udi Meiri

+1

On Fri, Nov 9, 2018 at 8:31 AM Maximilian Michels  wrote:

> +1
>
> On 09.11.18 09:38, Robert Bradshaw wrote:
> > +1 approve.
> > On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay  wrote:
> >>
> >> Hi all,
> >>
> >> Please review the following statement:
> >>
> >> "2.7.0 branch will be marked as the long-term-support (LTS) release
> branch. This branch will be supported for a window of 6 months starting
> from the day it is marked as an LTS branch. Beam community will decide on
> which issues will be backported and when patch releases on the branch will
> be made on a case by case basis.
> >>
> >> [ ] +1, Approve
> >> [ ] -1, Do not approve (please provide specific comments)
> >>
> >> Context:
> >> - Discussion on the dev@ list [1].
> >> - Beam's release policies [2].
> >>
> >> Thank you,
> >> Ahmet
> >>
> >> [1]
> https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
> >> [2] https://beam.apache.org/community/policies/
> >>
> >>
> >>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Spotless and lint precommit

2018-11-13 Thread Udi Meiri

+1 and parallelize the 3 lint tasks

On Tue, Nov 13, 2018 at 10:43 AM Thomas Weise  wrote:

> +1
>
>
> On Tue, Nov 13, 2018 at 9:06 AM Ruoyun Huang  wrote:
>
>> +1
>>
>> On Tue, Nov 13, 2018 at 8:29 AM Maximilian Michels 
>> wrote:
>>
>>> +1
>>>
>>> On 13.11.18 14:22, Robert Bradshaw wrote:
>>> > I really like how spottless runs separately and quickly for Java code.
>>> > Should we do the same for Python lint?
>>> >
>>>
>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Python PostCommit failures

2018-10-04 Thread Udi Meiri

Thanks Yifan!

On Thu, Oct 4, 2018 at 11:45 AM Yifan Zou  wrote:

> The integration test has been fixed by #6567
> . The Python PostCommit Verify
> is back to normal.
> Thanks.
>
> - Yifan
>
> On Thu, Oct 4, 2018 at 12:22 AM Yifan Zou  wrote:
>
>> We are seeing consecutive Python PostCommit
>> 
>> failures due to a broken IT test. I am looking into it (BEAM-5643
>> ).
>>
>> Yifan
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: python post-commit failures

2018-10-08 Thread Udi Meiri

It's the current status: I believe having a basic wordcount integration
test in precommit would have caught this issue, since it seems to have
broken all tests using the Dataflow service.

On Sun, Oct 7, 2018 at 9:06 PM Kenneth Knowles  wrote:

> Out of curiosity - is it a logical necessity, or just current status, that
> one needs to run a full job to catch this?
>
> On Sat, Oct 6, 2018 at 4:26 AM Maximilian Michels  wrote:
>
>> My changes to the Python option parsing broke the PostCommit. PreCommit
>> passed, as well as the Portable Runner tests. Sorry about that.
>>
>> +1 It would be great to have some more basic integration tests in the
>> PreCommit. That will give us more confidence before merge without always
>> running the PostCommit.
>>
>> > The same goes for Flink: we should be running wordcount and
>> wordcount_streaming integration tests as part of pre-commit tests.
>>
>> Can look into that.
>>
>> Thanks,
>> Max
>>
>> On 05.10.18 23:08, Ahmet Altay wrote:
>> >
>> >
>> > On Fri, Oct 5, 2018 at 1:51 PM, Udi Meiri > > <mailto:eh...@google.com>> wrote:
>> >
>> > I was sure that we ran some basic Dataflow integration tests in
>> > pre-commit, and that they should have caught this issue.
>> > But then I remembered that we only have those in Java SDK.
>> > I opened this bug to add end-to-end tests to Python pre-commits as
>> > well: https://issues.apache.org/jira/browse/BEAM-5058
>> > <https://issues.apache.org/jira/browse/BEAM-5058>
>> >
>> >
>> > +1
>> >
>> >
>> > The same goes for Flink: we should be running wordcount and
>> > wordcount_streaming integration tests as part of pre-commit tests.
>> >
>> > On Fri, Oct 5, 2018 at 1:37 PM Thomas Weise > > <mailto:t...@apache.org>> wrote:
>> >
>> > Fixed. Can someone please take a look at the usage of
>> > the --beam_plugins flag in the Dataflow runner so that we can
>> > address the root cause?
>> >
>> > We can probably do more to avoid Friday Python post commit
>> > excitement. In this case, extra checking was done pre-merge by
>> > running the Python VR tests for Flink, but the failure occurred
>> > with the Dataflow runner.
>> >
>> >
>> > It would be good to have a list of what post commit test are available,
>> > what do they test and what is the keyword to trigger them from a PR.
>> >
>> >
>> > The changes were pipeline options related, so (pre-existing)
>> > test coverage should have been better.
>> >
>> > But beyond that, we can probably make it easier for contributors
>> > and reviewers to know what extra checks are available and
>> > possibly appropriate to run pre-commit. Should we add some
>> > pointers to
>> > https://beam.apache.org/contribute/testing/#pre-commit
>> > <https://beam.apache.org/contribute/testing/#pre-commit> or is
>> > there a better place?
>> >
>> > Thanks
>> >
>> >
>> >
>> > On Fri, Oct 5, 2018 at 10:38 AM Udi Meiri > > <mailto:eh...@google.com>> wrote:
>> >
>> > More details in
>> > https://issues.apache.org/jira/browse/BEAM-5442
>> > <https://issues.apache.org/jira/browse/BEAM-5442>
>> >
>> > On Fri, Oct 5, 2018 at 10:26 AM Udi Meiri > > <mailto:eh...@google.com>> wrote:
>> >
>> > I'm seeing these errors at least in one test:
>> > "Python sdk harness failed:
>> > Traceback (most recent call last):
>> >File
>> >
>>  
>> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> > line 133, in main
>> >
>> > sdk_pipeline_options.get_all_options(drop_default=True))
>> >File
>> >
>>  
>> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/pipeline_options.py",
>> > line 224, in get_all_options
>> >  parser.add_argument(arg.split('=', 1)[0],
>> nargs='?')
>> >File "/usr/lib/python2.7/argparse.py",

Python SDK: .options deprecation

2018-09-21 Thread Udi Meiri

Hey, does anybody know why the pipeline.options property was deprecated?
I found this bug: https://issues.apache.org/jira/browse/BEAM-2124
but there's no explanation.


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Python SDK: .options deprecation

2018-09-24 Thread Udi Meiri

Sindy, I don't believe that pipeline.options() works, it's the deprecated
@property method.

Ahmet, the first paragraph of this doc seems to have some background about
the general direction: https://s.apache.org/no-beam-pipeline


On Fri, Sep 21, 2018 at 6:17 PM Ahmet Altay  wrote:

> If I remember correctly, this was related to change of the signature of
> run on runners to run(pipeline, options) instead of just run(pipeline). A
> runner accepts a pipeline and set of options to run that pipeline, instead
> of a pipeline and options being a combined thing.
>
> It would be good to update the comments or bug with some explanation.
>
> On Fri, Sep 21, 2018 at 2:20 PM, Sindy Li  wrote:
>
>> I think it's just deprecating referencing the option as
>>pipeline.options
>> , because it is made into a private variable, but we can still do
>>pipeline.options()
>>
>> On Fri, Sep 21, 2018 at 2:11 PM Udi Meiri  wrote:
>>
>>> Hey, does anybody know why the pipeline.options property was deprecated?
>>> I found this bug: https://issues.apache.org/jira/browse/BEAM-2124
>>> but there's no explanation.
>>>
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Removing documentation for old Beam versions

2018-09-24 Thread Udi Meiri

I believe that beam.apache.org is populated from the asf-site branch of the
apache/beam-site repo. (gitpubsub:
https://www.apache.org/dev/project-site.html#intro)
If we move the markdown-based docs to apache/beam, leave generated javadoc
and pydoc in apache/beam-site, and point gitpubsub to apache/beam, then
javadoc and pydoc will not get pushed to the website.

Is there some place where we can push javadoc and pydoc files? Or perhaps
there an alternative way to push updates to beam.apache.org? (not requiring
the asf-site branch)

On Fri, Sep 21, 2018 at 6:40 PM Thomas Weise  wrote:

> Hi Scott,
>
> Thanks for bringing the discussion back here.
>
> I agree that we should separate the changes for hosting of generated
> java/pydocs from the rest of website automation so that we can make the
> switch and fix the contributor headache soon.
>
> But perhaps we can avoid adding 4m lines of generated code to the main
> beam repository (and keep on adding with every release) if we continue to
> serve the site from the old beam-site repo? (I left a comment the doc.)
>
> About trying buildbot, as mentioned earlier I would be happy to help with
> it. I prefer a setup that keeps the docs separate from the web site.
>
> Thomas
>
>
> On Fri, Sep 21, 2018 at 10:28 AM Scott Wegner  wrote:
>
>> Re-opening this thread as it came up today in the discussion for PR#6458
>> [1]. This PR is part of the work for Beam-Site Automation Reliability
>> improvements; design doc here: https://s.apache.org/beam-site-automation
>>
>> The current plan is to keep generated javadoc/pydoc sources only on the
>> asf-site branch, which is necessary for the current githubpubsub publishing
>> mechanism. This maintains our current approach, the only change being that
>> we're moving the asf-site branch from the retiring apache/beam-site
>> repository into a new apache/beam repo branch.
>>
>> The concern for committing generated content is the extra overhead during
>> git fetch. I did some analysis to measure the impact [2], and found that
>> fetching a week of source + generated content history from apache/beam-site
>> took 0.39 seconds.
>>
>> I like the idea of publishing javadoc/pydoc snapshots to an external
>> location like Flink does with buildbot, but that work is separable and
>> shouldn't be a prerequisite for this effort. The goal of this work is to
>> improve the reliability of automation for contributing website changes. At
>> last measure, only about half of beam-site PR merges use Mergebot without
>> experiencing some reliability issue [3].
>>
>> I've opened BEAM-5459 [4] to track moving our generated docs out of git.
>> Thomas, would you have bandwidth to look into this?
>>
>> [1] https://github.com/apache/beam/pull/6458#issuecomment-423406643
>> [2]
>> https://docs.google.com/document/d/1lfbMhdIyDzIaBTgc9OUByhSwR94kfOzS_ozwKWTVl5U/edit#heading=h.uqzivheohd7j
>> [3]
>> https://docs.google.com/document/d/1lfbMhdIyDzIaBTgc9OUByhSwR94kfOzS_ozwKWTVl5U/edit#heading=h.a208cwi78xmu
>> [4] https://issues.apache.org/jira/browse/BEAM-5459
>>
>> On Fri, Aug 24, 2018 at 11:48 AM Thomas Weise  wrote:
>>
>>> Hi Udi,
>>>
>>> Good to know you will continue this work.
>>>
>>> Let me know if you want to try the buildbot route (which does not
>>> require generated documentation to be checked into the repo). Happy to help
>>> with that.
>>>
>>> Thomas
>>>
>>> On Fri, Aug 24, 2018 at 11:36 AM Udi Meiri  wrote:
>>>
>>>> I'm picking up the website migration. The plan is to not include
>>>> generated files in the master branch.
>>>>
>>>> However, I've been told that even putting generated files a separate
>>>> branch could blow up the git repository for all (e.g. make git pulls a lot
>>>> longer?).
>>>> Not sure if this is a real issue or not.
>>>>
>>>> On Mon, Aug 20, 2018 at 2:53 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> On Sun, Aug 5, 2018 at 5:28 AM Thomas Weise  wrote:
>>>>> >
>>>>> > Yes, I think the separation of generated code will need to occur
>>>>> prior to completing the merge and switching the web site to the main repo.
>>>>> >
>>>>> > There should be no reason to check generated documentation into
>>>>> either of the repos/branches.
>>>>>
>>>>> Huge +1 to this. Thomas, would have time to set something like this up
>>>>> for Beam? If not, could anyone else pick this up?
>>&g

Re: Removing documentation for old Beam versions

2018-09-24 Thread Udi Meiri

Staying on beam-site SGTM. We could add a new default branch (master?) and
keep all the non-generated files (src/) there, and put generated files
(content/) in the asf-site branch (like we already do).
That way there's no confusion as to which files you should update.
(This is of course assuming we still place generated docs in git repos.)

On Mon, Sep 24, 2018 at 11:23 AM Thomas Weise  wrote:

> My thought was to leave the asf-site branch in the beam-site repository,
> add generated docs to that branch (until we have a better solution), and
> have only sources in the beam repo.
>
> Scott had filed https://issues.apache.org/jira/browse/BEAM-5459 -
> it would eliminate the need to place generated docs into git repos.
>
> On Mon, Sep 24, 2018 at 11:06 AM Udi Meiri  wrote:
>
>> I believe that beam.apache.org is populated from the asf-site branch of
>> the apache/beam-site repo. (gitpubsub:
>> https://www.apache.org/dev/project-site.html#intro)
>> If we move the markdown-based docs to apache/beam, leave generated
>> javadoc and pydoc in apache/beam-site, and point gitpubsub to apache/beam,
>> then javadoc and pydoc will not get pushed to the website.
>>
>> Is there some place where we can push javadoc and pydoc files? Or perhaps
>> there an alternative way to push updates to beam.apache.org? (not
>> requiring the asf-site branch)
>>
>> On Fri, Sep 21, 2018 at 6:40 PM Thomas Weise  wrote:
>>
>>> Hi Scott,
>>>
>>> Thanks for bringing the discussion back here.
>>>
>>> I agree that we should separate the changes for hosting of generated
>>> java/pydocs from the rest of website automation so that we can make the
>>> switch and fix the contributor headache soon.
>>>
>>> But perhaps we can avoid adding 4m lines of generated code to the main
>>> beam repository (and keep on adding with every release) if we continue to
>>> serve the site from the old beam-site repo? (I left a comment the doc.)
>>>
>>> About trying buildbot, as mentioned earlier I would be happy to help
>>> with it. I prefer a setup that keeps the docs separate from the web site.
>>>
>>> Thomas
>>>
>>>
>>> On Fri, Sep 21, 2018 at 10:28 AM Scott Wegner  wrote:
>>>
>>>> Re-opening this thread as it came up today in the discussion for
>>>> PR#6458 [1]. This PR is part of the work for Beam-Site Automation
>>>> Reliability improvements; design doc here:
>>>> https://s.apache.org/beam-site-automation
>>>>
>>>> The current plan is to keep generated javadoc/pydoc sources only on the
>>>> asf-site branch, which is necessary for the current githubpubsub publishing
>>>> mechanism. This maintains our current approach, the only change being that
>>>> we're moving the asf-site branch from the retiring apache/beam-site
>>>> repository into a new apache/beam repo branch.
>>>>
>>>> The concern for committing generated content is the extra overhead
>>>> during git fetch. I did some analysis to measure the impact [2], and found
>>>> that fetching a week of source + generated content history from
>>>> apache/beam-site took 0.39 seconds.
>>>>
>>>> I like the idea of publishing javadoc/pydoc snapshots to an external
>>>> location like Flink does with buildbot, but that work is separable and
>>>> shouldn't be a prerequisite for this effort. The goal of this work is to
>>>> improve the reliability of automation for contributing website changes. At
>>>> last measure, only about half of beam-site PR merges use Mergebot without
>>>> experiencing some reliability issue [3].
>>>>
>>>> I've opened BEAM-5459 [4] to track moving our generated docs out of
>>>> git. Thomas, would you have bandwidth to look into this?
>>>>
>>>> [1] https://github.com/apache/beam/pull/6458#issuecomment-423406643
>>>> [2]
>>>> https://docs.google.com/document/d/1lfbMhdIyDzIaBTgc9OUByhSwR94kfOzS_ozwKWTVl5U/edit#heading=h.uqzivheohd7j
>>>> [3]
>>>> https://docs.google.com/document/d/1lfbMhdIyDzIaBTgc9OUByhSwR94kfOzS_ozwKWTVl5U/edit#heading=h.a208cwi78xmu
>>>> [4] https://issues.apache.org/jira/browse/BEAM-5459
>>>>
>>>> On Fri, Aug 24, 2018 at 11:48 AM Thomas Weise  wrote:
>>>>
>>>>> Hi Udi,
>>>>>
>>>>> Good to know you will continue this work.
>>>>>
>>>>> Let me know if you want to try the buildbot route (which does not
>>>

Re: Removing documentation for old Beam versions

2018-09-26 Thread Udi Meiri

Just to be clear, generated html for javadoc and pydoc will be put in
apache/beam-site, but generated html for .md files will be put in
apache/beam under the asf-site branch.

On Wed, Sep 26, 2018 at 9:34 AM Thomas Weise  wrote:

> Looks like the is agreement that all sources should be in the main beam
> repository, the remaining discussion was where the generated content should
> be served from, specifically the generated docs.
>
> If the setup that Alan found allows us to keep using the beam-site
> repository for the generated stuff and that does not unreasonably
> complicate the CI process, then I'm in favor of that. It looks cleaner to
> not mingle source and generated files in the same repo. Otherwise we can do
> the asf-site branch in the main repo and get rid of docs from it once we
> found a better solution.
>
>
> On Wed, Sep 26, 2018 at 7:09 AM Robert Bradshaw 
> wrote:
>
>> OK, thanks. That link was very helpful. Of the three options we must use,
>> checking into git seems preferable than checking into svn let alone the
>> CMS. Keeping the same repo means that it's harder to generate the docs for
>> version X while head is checked out.
>>
>> I'm in favor of moving forward with this in the short term, but we should
>> expore other options (like Flink has) for the longer term.
>>
>>
>>
>> On Wed, Sep 26, 2018 at 3:53 PM Scott Wegner  wrote:
>>
>>> Yes. There are few options for publishing your ASF website, described
>>> here: https://www.apache.org/dev/project-site.html. We can publish from
>>> a Git repo, SVN, or a UI-based CMS interface.
>>>
>>> On Wed, Sep 26, 2018 at 9:45 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> I am also definitely in favor of a single repository. Perhaps I'm just
>>>> misunderstanding why the generated must be put in a git repository at
>>>> all--is it because that's the easiest way to serve them?
>>>>
>>>> On Wed, Sep 26, 2018 at 3:39 PM Scott Wegner  wrote:
>>>>
>>>>> Alan found the place where website publishing is configured [1], which
>>>>> has examples of project sites being configured with more than one git 
>>>>> root.
>>>>> This is great for us because it allows us to leave generated
>>>>> javadocs/pydocs in the beam-site repository and publish website markdown
>>>>> content from the main repo.
>>>>>
>>>>> Alan has a PR ready to publish generated HTML in a post-commit job
>>>>> [2]. Once that goes through the last step is to upgrade the publishing
>>>>> config.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/infrastructure-puppet/blob/deployment/modules/gitwcsub/files/config/gitwcsub.cfg
>>>>> [2] https://github.com/apache/beam/pull/6431
>>>>>
>>>>> On Mon, Sep 24, 2018 at 4:35 PM Scott Wegner 
>>>>> wrote:
>>>>>
>>>>>> > We could add a new default branch (master?) and keep all the
>>>>>> non-generated files (src/) there, and put generated files (content/) in 
>>>>>> the
>>>>>> asf-site branch (like we already do).
>>>>>>
>>>>>> I'm strongly in favor of having sources in a single repository. We
>>>>>> have significant process and infrastructure built up for the apache/beam
>>>>>> repo (for build, PR, CI, release, etc.) that we can take advantage of by
>>>>>> putting website sources in the same repo. The current beam-site repo PR
>>>>>> automation is flaky because it was custom-built and not given the same
>>>>>> level of attention as the main repo.
>>>>>>
>>>>>> The caveat to consolidating website sources in the main repo is that
>>>>>> it incentivizes putting the generated sources branch on the same repo. 
>>>>>> I've
>>>>>> documented a few of the reasons in the Appendix of the design doc [1]:
>>>>>>  - It's easier to maintain a single repository; easily apply existing
>>>>>> tooling/infrastructure
>>>>>> - Jenkins tooling for publishing generated HTML may not work
>>>>>> cross-repo [2]
>>>>>>
>>>>>> My preference is to move forward with the migration of sources to
>>>>>> apache/beam [master], and website generated HTML to apache/beam 
>>>>>> [asf-site].
>>>>>> I like the idea of separating the publishing/h

Python License code

2018-09-26 Thread Udi Meiri

Hi,
I'm reviewing a PR that has code licensed under Python License. It is
under category
A  so it's okay to
include.
The question is: where do we put the license notice? Is it sufficient to
place the code in a separate module with the license pasted at the top?

https://github.com/apache/beam/pull/6423/files#r220284399


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Python License code

2018-09-26 Thread Udi Meiri

I've opened https://issues.apache.org/jira/browse/LEGAL-417

On Wed, Sep 26, 2018 at 11:43 AM Kenneth Knowles  wrote:

> We do have a top-level NOTICE file. I am neither a lawyer nor familiar
> with the details of the requirements of the Python Software Foundation
> License.
>
> In addition to the page you linked, have you followed the link to the
> LEGAL Jira space (
> https://www.apache.org/legal/resolved.html#asking-questions)? You may
> find your answer there, or could ask.
>
> Kenn
>
> On Wed, Sep 26, 2018 at 11:25 AM Udi Meiri  wrote:
>
>> Hi,
>> I'm reviewing a PR that has code licensed under Python License. It is
>> under category A <https://www.apache.org/legal/resolved.html#category-a> so
>> it's okay to include.
>> The question is: where do we put the license notice? Is it sufficient to
>> place the code in a separate module with the license pasted at the top?
>>
>> https://github.com/apache/beam/pull/6423/files#r220284399
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-20 Thread Udi Meiri

Congrats!

On Thu, Sep 20, 2018 at 10:09 AM Raghu Angadi  wrote:

> Congrats Kenn!
>
> On Wed, Sep 19, 2018 at 12:54 PM Davor Bonaci  wrote:
>
>> Hi everyone --
>> It is with great pleasure that I announce that at today's meeting of the
>> Foundation's Board of Directors, the Board has appointed Kenneth Knowles as
>> the second chair of the Apache Beam project.
>>
>> Kenn has served on the PMC since its inception, and is very active and
>> effective in growing the community. His exemplary posts have been cited in
>> other projects. I'm super happy to have Kenn accepted the nomination, and
>> I'm confident that he'll serve with distinction.
>>
>> As for myself, I'm not going anywhere. I'm still around and will be as
>> active as I have recently been. Thrilled to be able to pass the baton to
>> such a key member of this community and to have less administrative work to
>> do ;-).
>>
>> Please join me in welcoming Kenn to his new role, and I ask that you
>> support him as much as possible. As always, please let me know if you have
>> any questions.
>>
>> Davor
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: TestDirectRunner for Java?

2019-01-16 Thread Udi Meiri

Existing one? I'm not sure what you mean. There's already
TestPipelineOptions.setOnSuccessMatcher(), but that's silently ignored on
runners like DirectRunner that don't support it.

As I see it the differences are:
onSuccessMatcher - runs after the pipeline has completed successfully, can
be applied on file output contents (e.g. FileChecksumMatcher)
PAssert - runs as part of the pipeline and is applied to PCollection
contents

Meanwhile, I can also waitUntilFinish() and manually run the assert:
assertThat(result, new FileChecksumMatcher(EXPECTED_CHECKSUM,
filenamePrefix + "*-of-*"));

In my case I need to verify file output contents.

On Tue, Jan 15, 2019 at 5:56 PM Kenneth Knowles  wrote:

> Since it is primarily for testing, how about just making it use the
> existing one in the pipeline options? I'm honestly a bit lost as to what
> the use case was when that was introduced, versus waiting for termination
> and running the assertion more directly. Can you enlighten me?
>
> Kenn
>
> On Tue, Jan 15, 2019 at 4:15 PM Udi Meiri  wrote:
>
>> Hi,
>> I want to use DirectRunner for a new IT I'm writing, since it's testing
>> I/O code that's runner agnostic. The problem is that DirectRunner doesn't
>> have a TestDataflowRunner analog, so features like OnSuccessMatcher aren't
>> available.
>>
>> Any objections to adding a TestDirectRunner class?
>>
>

smime.p7s
Description: S/MIME Cryptographic Signature

1 2 3 4 >

1 - 100 of 374 matches

Mail list logo