NPE in Calcite dialect when input PCollection has logical type in schema, from JdbcIO Transform

2020-04-30 Thread rahul patwari
Hi,

A JIRA ticket is raised to track this bug: BEAM-8307


I have raised a PR: https://github.com/apache/beam/pull/11581 to fix the
issue.

This PR takes care of using BeamSql with JdbcIO.
I would be interested to contribute if any other IOs supported by Beam
requires a similar fix like the one in this PR so that they can be used
with BeamSql.

What could be a cleaner approach, in general, to handle this for all the
IOs?

Also, what can be done to support BeamSql with User-Defined Logical Types?
Should they be converted to one of the Beam SQL Types[1] before applying
SqlTransform.query()?
Should we expose an interface to provide Calcite RelDataType Mapping for
User-Defined Logical Types?

Let me know your thoughts.

[1]:
https://github.com/apache/beam/blob/b8aa8486f336df6fc9cf581f29040194edad3b87/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java#L43

Regards,
Rahul


Re: Jenkins jobs not running for my PR 10438

2020-04-30 Thread Tomo Suzuki
Hi Beam committers,

Would you trigger the precommit checks for this PR?
https://github.com/apache/beam/pull/11586

Regards,
Tomo


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Ahmet Altay
Nam,

 - Website looks good and looks the same as the current website. (Visually
comparing a few pages, not a deep analysis.)
- contribute.md looks good. (this is new content.)
- website/Dockerfile and website/README.md changes look good.
- I do not know what is the new version of some files, for example:
website/src/_data/authors.yml,  website/src/_data/capability-matrix.yml --
what replaces them?

There are 887 file changes. It is not easy to review this. I wanted to go
commit by commit, but that did not help much. How about we try to organize
this review as reviewable commits.
- Changes to the mechanics (jekyll to hugo), themes, build files, website
related readmes etc. This will likely be a smaller change in number of
files. (This will likely have many completed new, and completely deleted
files. Only a few files have meaningful diffs.)
- Changes to the content. This might be a large number of files with
minimal changes. I do not think we can manually review each file, but at
least a quick review of minimal changes to each file would be good enough.

What do you think?

Ahmet

On Thu, Apr 30, 2020 at 4:29 PM Hannah Jiang  wrote:

> Since we want to move forward with the PR, I would like to ask the
>> community to hold off changes to the current Beam website for a week, until
>> we are able to review and merge the PR. Is this acceptable to everyone?
>
> Do we have an exact date when we can push changes to the website? I have
> PRs to update documents so would like to plan ahead.
>
> On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:
>
>> Hey guys,
>>
>> I tried my best to handle renamed files in Git. I have no clue why GitHub
>> doesn't show it, but finally, I made this commit [1] (thanks for your
>> idea @bhulette) so you guys can review changes with ease (there is no bunch
>> of deleted markdown files anymore :D). Also, new staged version is
>> deployed, you could check it out [2].
>>
>> In case you are interested in translation, here is the proof of concept
>> [3] (the earth icon on the right corner is temporarily used for switching
>> languages). You can take a look at the translation guide for this PoC [4].
>>
>> [1]
>> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
>> [2]
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
>> [3] https://safe-relation.surge.sh/
>> [4]
>> https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
>>
>>
>> On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette 
>> wrote:
>>
>>> Changing the URLs is fine with me as long as the old urls will work too.
>>>
>>> But do we need to change the filenames for the blog posts to accomplish
>>> that? It's nice that the blog post markdown files start with a date so they
>>> naturally sort chronologically. It looks like this hugo PR [1] made it
>>> possible to extract date metadata and slug
>>> (i.e. dataflow-python-sdk-is-now-public) separately from the filename.
>>>
>>> [1] https://github.com/gohugoio/hugo/pull/4494
>>>
>>> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>>>


 On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:

> For changed URLs, will previous URLs be mapped to avoid broken
> external links?
>

 I believe the answer is yes from Nam's response "For now, we keep the
 old URLs working in terms of redirecting them". I very much agree that this
 is very important and should work for all existing urls.


>
>
> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Hi,
>>
>> To give a little more context regarding the URLs, the date should
>> still appear on the blog post, but not on the URL.
>> For example, we'd have:
>>
>> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>> become
>> https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>
>
 I am not a content marketer. IMO, this is a good change. In the past, a
 few times, we edited dates on posts (e.g. a release date was entered
 incorrectly) and we had to either have a mismatch between dates in the url
 and the date in the blog, or change the url. This change simplifies, by
 having date only in place (in content metadata).


>
>> The blog posts would have a small header showing the title, author
>> and publish date. But the URL would not have it.
>> Thoughts?
>>
>>
>> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>
>>> Hi,
>>>
>>> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging
>>> version is "
>>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
>>> which also includes "/11554", and Hugo considers it as a path so it 
>>> breaks
>>> the path of "static files" (like images). We made a fix. Now I'm 
>>> working on
>>> "getti

Re: Jenkins jobs not running for my PR 10438

2020-04-30 Thread Ahmet Altay
Done.

On Thu, Apr 30, 2020 at 7:21 PM rahul patwari 
wrote:

> Hi Committers,
>
> Can you please trigger tests for
> https://github.com/apache/beam/pull/11569 and
> https://github.com/apache/beam/pull/11581
>
> Thanks,
> Rahul
>
> On Tue, 28 Apr 2020, 10:58 pm Alexey Romanenko, 
> wrote:
>
>> Thanks Udi! I'll track for updates on this.
>>
>> On 28 Apr 2020, at 19:16, Udi Meiri  wrote:
>>
>> Alexey, what you're doing should be working (commits should trigger
>> tests, as should "retest this please" and other phrases).
>>
>> https://issues.apache.org/jira/browse/INFRA-19836 tracks this issue
>>
>> On Tue, Apr 28, 2020 at 10:04 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Does anyone know the “golden rule” how to trigger Jenkins tests?
>>>
>>> For example:
>>> https://github.com/apache/beam/pull/11341
>>> I tried several times and it’s still not triggered.
>>>
>>> On 28 Apr 2020, at 13:33, Ismaël Mejía  wrote:
>>>
>>> done
>>>
>>> On Tue, Apr 28, 2020 at 12:47 PM Shoaib Zafar <
>>> shoaib.za...@venturedive.com> wrote:
>>>
 Hello Beam Committers,

 I would appreciate if you could trigger precommit checks for the PR:
 https://github.com/apache/beam/pull/11210 along with the python
 post-commit check (Run Python 3.5 PostCommit).

 Thanks and Regards.

 *Shoaib Zafar*
 Software Engineering Lead
 Mobile: +92 333 274 6242
 Skype: live:shoaibzafar_1

 


 On Wed, Apr 22, 2020 at 9:40 PM Rehman Murad Ali <
 rehman.murad...@venturedive.com> wrote:

> Hello Beam Committers.
>
> Would you please trigger basic tests as well as all *validatesRunner*
> test on this PR:
> https://github.com/apache/beam/pull/11154 
> 
>
>
> *Thanks & Regards*
>
>
>
> *Rehman Murad Ali*
> Software Engineer
> Mobile: +92 3452076766 <+92%20345%202076766>
> Skype: rehman.muradali
>
>
> On Wed, Apr 22, 2020 at 9:25 PM Yoshiki Obata 
> wrote:
>
>> Hello Beam Committers,
>>
>> I would appreciate if you could trigger precommit checks for these
>> PRs;
>> https://github.com/apache/beam/pull/11493
>> https://github.com/apache/beam/pull/11494
>>
>> Regards
>> yoshiki
>>
>> 2020年4月21日(火) 1:11 Luke Cwik :
>>
>>> The precommits started and I provided the comments for the
>>> postcommits as you have requested but they have yet to start.
>>>
>>> On Mon, Apr 20, 2020 at 8:31 AM Shoaib Zafar <
>>> shoaib.za...@venturedive.com> wrote:
>>>
 Hello Beam Committers.

 Would you please trigger the pre-commit checks on the PR:
 https://github.com/apache/beam/pull/11210 along with the python
 post-commit checks (Run Python PostCommit, Run Python 3.5 PostCommit)?

 Thanks! Regards,

 *Shoaib Zafar*
 Software Engineering Lead
 Mobile: +92 333 274 6242
 Skype: live:shoaibzafar_1

 


 On Fri, Apr 17, 2020 at 1:19 PM Ismaël Mejía 
 wrote:

> done
>
> On Thu, Apr 16, 2020 at 4:32 PM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Hello Beam Committers.
>>
>> Would you please trigger basic tests as well as validatesRunner
>> test on this PR:
>>
>> 
>> https://github.com/apache/beam/pull/11350
>>
>>
>> *Thanks & Regards*
>>
>>
>>
>> *Rehman Murad Ali*
>> Software Engineer
>> Mobile: +92 3452076766 <+92%20345%202076766>
>> Skype: rehman.muradali
>>
>>
>> On Mon, Apr 13, 2020 at 10:16 PM Ahmet Altay 
>> wrote:
>>
>>> Done.
>>>
>>> On Mon, Apr 13, 2020 at 8:52 AM Shoaib Zafar <
>>> shoaib.za...@venturedive.com> wrote:
>>>
 Hello Beam Committers.

 Would you please trigger the pre-commit checks on the PR:
 https://github.com/apache/beam/pull/11210 along with the
 python post-commit checks (Run Python PostCommit, Run Python 3.5
 PostCommit)?

 Thanks!

 *Shoaib Zafar*
 Software Engineering Lead
 Mobile: +92 333 274 6242
 Skype: live:shoaibzafar_1

 


 On Mon, Apr 13, 2020 at 4:00 PM Ismaël Mejía 
 wrote:

> done
>
> On Mon, Apr 13, 2020 at 12:42 PM Rehman Murad Ali
>  wrote:
> >
> > Hi Beam Committers!
> >
> > Thanks(

Re: Jenkins jobs not running for my PR 10438

2020-04-30 Thread rahul patwari
Hi Committers,

Can you please trigger tests for  https://github.com/apache/beam/pull/11569
and https://github.com/apache/beam/pull/11581

Thanks,
Rahul

On Tue, 28 Apr 2020, 10:58 pm Alexey Romanenko, 
wrote:

> Thanks Udi! I'll track for updates on this.
>
> On 28 Apr 2020, at 19:16, Udi Meiri  wrote:
>
> Alexey, what you're doing should be working (commits should trigger tests,
> as should "retest this please" and other phrases).
>
> https://issues.apache.org/jira/browse/INFRA-19836 tracks this issue
>
> On Tue, Apr 28, 2020 at 10:04 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Does anyone know the “golden rule” how to trigger Jenkins tests?
>>
>> For example:
>> https://github.com/apache/beam/pull/11341
>> I tried several times and it’s still not triggered.
>>
>> On 28 Apr 2020, at 13:33, Ismaël Mejía  wrote:
>>
>> done
>>
>> On Tue, Apr 28, 2020 at 12:47 PM Shoaib Zafar <
>> shoaib.za...@venturedive.com> wrote:
>>
>>> Hello Beam Committers,
>>>
>>> I would appreciate if you could trigger precommit checks for the PR:
>>> https://github.com/apache/beam/pull/11210 along with the python
>>> post-commit check (Run Python 3.5 PostCommit).
>>>
>>> Thanks and Regards.
>>>
>>> *Shoaib Zafar*
>>> Software Engineering Lead
>>> Mobile: +92 333 274 6242
>>> Skype: live:shoaibzafar_1
>>>
>>> 
>>>
>>>
>>> On Wed, Apr 22, 2020 at 9:40 PM Rehman Murad Ali <
>>> rehman.murad...@venturedive.com> wrote:
>>>
 Hello Beam Committers.

 Would you please trigger basic tests as well as all *validatesRunner*
 test on this PR:
 https://github.com/apache/beam/pull/11154 
 


 *Thanks & Regards*



 *Rehman Murad Ali*
 Software Engineer
 Mobile: +92 3452076766 <+92%20345%202076766>
 Skype: rehman.muradali


 On Wed, Apr 22, 2020 at 9:25 PM Yoshiki Obata 
 wrote:

> Hello Beam Committers,
>
> I would appreciate if you could trigger precommit checks for these PRs;
> https://github.com/apache/beam/pull/11493
> https://github.com/apache/beam/pull/11494
>
> Regards
> yoshiki
>
> 2020年4月21日(火) 1:11 Luke Cwik :
>
>> The precommits started and I provided the comments for the
>> postcommits as you have requested but they have yet to start.
>>
>> On Mon, Apr 20, 2020 at 8:31 AM Shoaib Zafar <
>> shoaib.za...@venturedive.com> wrote:
>>
>>> Hello Beam Committers.
>>>
>>> Would you please trigger the pre-commit checks on the PR:
>>> https://github.com/apache/beam/pull/11210 along with the python
>>> post-commit checks (Run Python PostCommit, Run Python 3.5 PostCommit)?
>>>
>>> Thanks! Regards,
>>>
>>> *Shoaib Zafar*
>>> Software Engineering Lead
>>> Mobile: +92 333 274 6242
>>> Skype: live:shoaibzafar_1
>>>
>>> 
>>>
>>>
>>> On Fri, Apr 17, 2020 at 1:19 PM Ismaël Mejía 
>>> wrote:
>>>
 done

 On Thu, Apr 16, 2020 at 4:32 PM Rehman Murad Ali <
 rehman.murad...@venturedive.com> wrote:

> Hello Beam Committers.
>
> Would you please trigger basic tests as well as validatesRunner
> test on this PR:
>
> 
> https://github.com/apache/beam/pull/11350
>
>
> *Thanks & Regards*
>
>
>
> *Rehman Murad Ali*
> Software Engineer
> Mobile: +92 3452076766 <+92%20345%202076766>
> Skype: rehman.muradali
>
>
> On Mon, Apr 13, 2020 at 10:16 PM Ahmet Altay 
> wrote:
>
>> Done.
>>
>> On Mon, Apr 13, 2020 at 8:52 AM Shoaib Zafar <
>> shoaib.za...@venturedive.com> wrote:
>>
>>> Hello Beam Committers.
>>>
>>> Would you please trigger the pre-commit checks on the PR:
>>> https://github.com/apache/beam/pull/11210 along with the python
>>> post-commit checks (Run Python PostCommit, Run Python 3.5 
>>> PostCommit)?
>>>
>>> Thanks!
>>>
>>> *Shoaib Zafar*
>>> Software Engineering Lead
>>> Mobile: +92 333 274 6242
>>> Skype: live:shoaibzafar_1
>>>
>>> 
>>>
>>>
>>> On Mon, Apr 13, 2020 at 4:00 PM Ismaël Mejía 
>>> wrote:
>>>
 done

 On Mon, Apr 13, 2020 at 12:42 PM Rehman Murad Ali
  wrote:
 >
 > Hi Beam Committers!
 >
 > Thanks( Ismael )
 >
 > I appreciate if someone could trigger these tests on this PR
 https://github.com/apache/beam/pull/11154
 >
 > run dataflow validatesrunner
 > run flink validate

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Hannah Jiang
>
> Since we want to move forward with the PR, I would like to ask the
> community to hold off changes to the current Beam website for a week, until
> we are able to review and merge the PR. Is this acceptable to everyone?

Do we have an exact date when we can push changes to the website? I have
PRs to update documents so would like to plan ahead.

On Thu, Apr 30, 2020 at 1:17 PM Nam Bui  wrote:

> Hey guys,
>
> I tried my best to handle renamed files in Git. I have no clue why GitHub
> doesn't show it, but finally, I made this commit [1] (thanks for your
> idea @bhulette) so you guys can review changes with ease (there is no bunch
> of deleted markdown files anymore :D). Also, new staged version is
> deployed, you could check it out [2].
>
> In case you are interested in translation, here is the proof of concept
> [3] (the earth icon on the right corner is temporarily used for switching
> languages). You can take a look at the translation guide for this PoC [4].
>
> [1]
> https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
> [2]
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
> [3] https://safe-relation.surge.sh/
> [4]
> https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide
>
>
> On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette  wrote:
>
>> Changing the URLs is fine with me as long as the old urls will work too.
>>
>> But do we need to change the filenames for the blog posts to accomplish
>> that? It's nice that the blog post markdown files start with a date so they
>> naturally sort chronologically. It looks like this hugo PR [1] made it
>> possible to extract date metadata and slug
>> (i.e. dataflow-python-sdk-is-now-public) separately from the filename.
>>
>> [1] https://github.com/gohugoio/hugo/pull/4494
>>
>> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
>>>
 For changed URLs, will previous URLs be mapped to avoid broken external
 links?

>>>
>>> I believe the answer is yes from Nam's response "For now, we keep the
>>> old URLs working in terms of redirecting them". I very much agree that this
>>> is very important and should work for all existing urls.
>>>
>>>


 On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

> Hi,
>
> To give a little more context regarding the URLs, the date should
> still appear on the blog post, but not on the URL.
> For example, we'd have:
>
> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
> become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/
> .
>

>>> I am not a content marketer. IMO, this is a good change. In the past, a
>>> few times, we edited dates on posts (e.g. a release date was entered
>>> incorrectly) and we had to either have a mismatch between dates in the url
>>> and the date in the blog, or change the url. This change simplifies, by
>>> having date only in place (in content metadata).
>>>
>>>

> The blog posts would have a small header showing the title, author and
> publish date. But the URL would not have it.
> Thoughts?
>
>
> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>
>> Hi,
>>
>> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version
>> is "
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
>> which also includes "/11554", and Hugo considers it as a path so it 
>> breaks
>> the path of "static files" (like images). We made a fix. Now I'm working 
>> on
>> "getting git to recognize files as renames" as you suggested.
>>
>> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
>> discussed with Aizhamal in the development stage and agreed to get rid of
>> this. For now, we keep the old URLs working in terms of redirecting them.
>> However, from now on, we should change the name convention on blog posts 
>> to
>> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>>
>>
>>
>> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay 
>>> wrote:
>>>
 Nam, this looks better. At least links are working, and the website
 visually looks similar and generally in good shape. I think there are 
 still
 issues. For example, I do not see any of the images (e.g. the beam 
 logo on
 top left is missing.)

 On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
 wrote:

> I left a comment on the PR [1]. I think the reason all of the
> website content is not being tracked as file renames is because there 
> was a
> series of commits that created files in the new di

Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Ismaël Mejía
Exact and it is not the same because there is an extra layer, because
the PortableRunner does not deal with the same issues that the other
runners e.g. translation and execution in the target system, it feels
more proxy than the 'translating runners' in the open source case.

On Thu, Apr 30, 2020 at 9:53 PM Kyle Weaver  wrote:
>
> > all runners (with perhaps the exception of the direct runner) are proxies 
> > for actual runners
>
> Agreed. The main difference is that this fact is more obvious for Dataflow 
> users, since it is "Cloud" Dataflow after all. The relationship of Beam to 
> its OSS runners is much less clear to new users (for example, folks are often 
> confused about the difference between Beam's Flink job server images and 
> Flink's own Docker images).
>
> > though we could argue that the direct runner would be a reasonable default
>
> Why set runner=PortableRuner then, when direct runner is the default? 
> Besides, the direct runner has its own murky status with regard to 
> portability, and its own defaults and branching paths, so I'd rather leave 
> that out of the equation.
>
> On Thu, Apr 30, 2020 at 3:23 PM Robert Bradshaw  wrote:
>>
>> In a sense, all runners (with perhaps the exception of the direct runner) 
>> are proxies for actual runners. In that sense, I think it makes just as much 
>> sense to say "I want the portable runner with job endpoint X" as to say "I 
>> want the flink runner with master Y." Saying "I want the Portable Runner" 
>> without specifying an endpoint should, however, be undefined (though we 
>> could argue that the direct runner would be a reasonable default).
>>
>> On Thu, Apr 30, 2020 at 11:49 AM Ismaël Mejía  wrote:
>>>
>>> Thomas has a point on the PortableRunner name, I was super confused
>>> because of the `PortableRunner` not being a runner, I don't know if
>>> too late but maybe it is still worth to give it a better name.
>>>
>>> On Thu, Apr 30, 2020 at 8:41 PM Thomas Weise  wrote:
>>> >
>>> > +1 for removing the default runner. It has always been the Beam user 
>>> > expectation that a runner needs to be selected.
>>> >
>>> > "PortableRunner" isn't a runner (despite its name) - it's a proxy to a 
>>> > runner that the user specifies via job_endpoint.
>>> >
>>> > Thanks for cleaning this up!
>>> >
>>> > On Thu, Apr 30, 2020 at 10:11 AM Kyle Weaver  wrote:
>>> >>
>>> >> I'll bite :) Thanks for the feedback everyone!
>>> >>
>>> >> On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw  
>>> >> wrote:
>>> >>>
>>> >>> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
>>> >>>
>>> >>> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía  wrote:
>>> 
>>>  +1 for A there are zero reasons to have a default runner set by
>>>  default, being explicit is better as Robert suggests and it resolves
>>>  the confusion that the user reported.
>>> 
>>>  On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw  
>>>  wrote:
>>>  >
>>>  > +1, I was actually thinking about this just the other day. 
>>>  > PortableRunner should require job_endpoint to be set, and we can 
>>>  > have a nice error message directing the explicit use of FlinkRunner 
>>>  > for the old behavior.
>>>  >
>>>  > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver  
>>>  > wrote:
>>>  >>
>>>  >> > Could the error message suggest switching to FlinkRunner (and/or 
>>>  >> > other runners that start a job server for you)? Then it seems 
>>>  >> > like the breakage would only be a minor annoyance.
>>>  >>
>>>  >> Definitely.
>>>  >>
>>>  >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette  
>>>  >> wrote:
>>>  >>>
>>>  >>> Could the error message suggest switching to FlinkRunner (and/or 
>>>  >>> other runners that start a job server for you)? Then it seems like 
>>>  >>> the breakage would only be a minor annoyance.
>>>  >>>
>>>  >>> Brian
>>>  >>>
>>>  >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver  
>>>  >>> wrote:
>>>  
>>>   Hi all,
>>>  
>>>   Currently, when running a pipeline that has the options 
>>>   runner=PortableRunner and job_endpoint unset, the Python SDK 
>>>   spins up a Dockerized Flink job server [1]. This is problematic 
>>>   because the PortableRunner can be used by any portable runner. So 
>>>   for example, a Spark runner user was recently baffled when their 
>>>   job ran successfully but printed a bunch of Flink log messages.
>>>  
>>>   There are not too many uses of this default behavior to my 
>>>   knowledge, at least within Beam itself. The only example I could 
>>>   find was in the portableWordCount tests, which is mostly the same 
>>>   as portableWordCountFlinkRunner tests [2]. The default behavior 
>>>   is entirely superseded by the FlinkRunner class, which provides 
>>>   better encapsulation.
>>>  
>>>  

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Ahmet Altay
+1 to periodic cleanups of the workers. I do not know what would be a good
frequency daily or a different one. Do we have a jira for this?

On Thu, Apr 30, 2020 at 2:22 PM Udi Meiri  wrote:

> I summarized my idea here: https://issues.apache.org/jira/browse/BEAM-9865
>

+1 to this idea as well.


>
>
> On Thu, Apr 30, 2020 at 2:01 PM Maximilian Michels  wrote:
>
>> On 30.04.20 21:48, Hannah Jiang wrote:
>> > --info tag was passed to docker image build commands with PythonDocker
>> > Precommit to capture more logs. Without the tag, errors from
>> > DockerFile step are not printed out to the console.
>>
>> Thanks for the info (pun intended).
>>
>> On 30.04.20 21:48, Hannah Jiang wrote:
>> > Indeed, I can see the no space left on device in the following but
>> > not in the log above:
>> >
>> > --info tag was passed to docker image build commands with PythonDocker
>> > Precommit to capture more logs. Without the tag, errors from DockerFile
>> > step are not printed out to the console.
>> >
>> > On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri > > > wrote:
>> >
>> > I checked node 8 and it had over 40GB space available. Does your job
>> > require more than that?
>> >
>> > Long term, I'm thinking we could clean up workspaces for successful
>> > jobs. This should free up additional space (I guess at least 100GB).
>> > https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
>> > to clean workspaces at job start.
>> >
>> >
>> > On Thu, Apr 30, 2020, 07:33 Maximilian Michels > > > wrote:
>> >
>> > *It's working again, probably because it's running on a
>> different
>> > machine now.
>> >
>> > Who can check the disk space of the Jenkins hosts?
>> >
>> > Thanks,
>> > Max
>> >
>> > On 30.04.20 11:55, Maximilian Michels wrote:
>> > > Sorry, I meant to include the Jenkins log:
>> > >
>> >
>> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
>> > >
>> > > Thanks for investigating Hannah! Indeed, I can see the no
>> > space left on
>> > > device in the following but not in the log above:
>> > >
>> >
>> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
>> > >
>> > > I'm going to try running the build again. Do you think we
>> > could add more
>> > > storage to our Jenkins hosts or delete old build data?
>> > >
>> > > Thanks,
>> > > Max
>> > >
>> > > On 30.04.20 08:43, Hannah Jiang wrote:
>> > >> Max, I found a link from your PR and noticed below errors.
>> > This would be
>> > >> the true error.
>> > >>
>> > >> *07:57:03* >*Task :sdks:python:container:py37:docker*
>> > >> *07:57:03*  [91mERROR: Could not install packages due to an
>> > EnvironmentError: [Errno 28] No space left on device
>> > >> *07:57:03*
>> > >> *07:57:03*  [0m
>> > >> *07:57:03* >*Task :sdks:python:container:py35:docker*
>> > >> *07:57:03*  [91mERROR: Could not install packages due to an
>> > EnvironmentError: [Errno 28] No space left on device
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
>> > mailto:hannahji...@google.com>
>> > >> > > >> wrote:
>> > >>
>> > >> There is a PythonDocker Precommit test running for PRs
>> > with Python
>> > >> changes. It seems running well.[1]
>> > >> Max, can you please give me a link so I can check more
>> > details? Do
>> > >> other images with different Python versions fail as well?
>> > >>
>> > >>
>> >  1.
>> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
>> > >>
>> > >>
>> > >> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
>> > mailto:al...@google.com>
>> > >> >>
>> wrote:
>> > >>
>> > >> +Valentyn Tymofieiev > > > +Hannah Jiang
>> > >> > > > -- in case they have relevant
>> > >> information.
>> > >>
>> > >> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
>> > >> mailto:m...@apache.org>
>> > >> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> has anyone noticed the Python 3.7 Docker
>> > container fails to
>> > >> build? I
>> > >> haven't been able to build 

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Udi Meiri
I summarized my idea here: https://issues.apache.org/jira/browse/BEAM-9865


On Thu, Apr 30, 2020 at 2:01 PM Maximilian Michels  wrote:

> On 30.04.20 21:48, Hannah Jiang wrote:
> > --info tag was passed to docker image build commands with PythonDocker
> > Precommit to capture more logs. Without the tag, errors from
> > DockerFile step are not printed out to the console.
>
> Thanks for the info (pun intended).
>
> On 30.04.20 21:48, Hannah Jiang wrote:
> > Indeed, I can see the no space left on device in the following but
> > not in the log above:
> >
> > --info tag was passed to docker image build commands with PythonDocker
> > Precommit to capture more logs. Without the tag, errors from DockerFile
> > step are not printed out to the console.
> >
> > On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  > > wrote:
> >
> > I checked node 8 and it had over 40GB space available. Does your job
> > require more than that?
> >
> > Long term, I'm thinking we could clean up workspaces for successful
> > jobs. This should free up additional space (I guess at least 100GB).
> > https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
> > to clean workspaces at job start.
> >
> >
> > On Thu, Apr 30, 2020, 07:33 Maximilian Michels  > > wrote:
> >
> > *It's working again, probably because it's running on a different
> > machine now.
> >
> > Who can check the disk space of the Jenkins hosts?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 11:55, Maximilian Michels wrote:
> > > Sorry, I meant to include the Jenkins log:
> > >
> >
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> > >
> > > Thanks for investigating Hannah! Indeed, I can see the no
> > space left on
> > > device in the following but not in the log above:
> > >
> >
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> > >
> > > I'm going to try running the build again. Do you think we
> > could add more
> > > storage to our Jenkins hosts or delete old build data?
> > >
> > > Thanks,
> > > Max
> > >
> > > On 30.04.20 08:43, Hannah Jiang wrote:
> > >> Max, I found a link from your PR and noticed below errors.
> > This would be
> > >> the true error.
> > >>
> > >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> > >> *07:57:03*  [91mERROR: Could not install packages due to an
> > EnvironmentError: [Errno 28] No space left on device
> > >> *07:57:03*
> > >> *07:57:03*  [0m
> > >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> > >> *07:57:03*  [91mERROR: Could not install packages due to an
> > EnvironmentError: [Errno 28] No space left on device
> > >>
> > >>
> > >>
> > >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
> > mailto:hannahji...@google.com>
> > >>  > >> wrote:
> > >>
> > >> There is a PythonDocker Precommit test running for PRs
> > with Python
> > >> changes. It seems running well.[1]
> > >> Max, can you please give me a link so I can check more
> > details? Do
> > >> other images with different Python versions fail as well?
> > >>
> > >>
> >  1.
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> > >>
> > >>
> > >> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
> > mailto:al...@google.com>
> > >> >>
> wrote:
> > >>
> > >> +Valentyn Tymofieiev  > > +Hannah Jiang
> > >>  > > -- in case they have relevant
> > >> information.
> > >>
> > >> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> > >> mailto:m...@apache.org>
> > >> wrote:
> > >>
> > >> Hi,
> > >>
> > >> has anyone noticed the Python 3.7 Docker
> > container fails to
> > >> build? I
> > >> haven't been able to build the Python 3.7
> > container, neither
> > >> locally nor
> > >> on Jenkins.
> > >>
> > >> I get:
> > >>
> > >> 17:48:10 > Task :sdks:python:container:py37:docker
> > >> 17:49:36 The command '/bin/sh -c pip install -r
> 

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
On 30.04.20 21:48, Hannah Jiang wrote:
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from
> DockerFile step are not printed out to the console.

Thanks for the info (pun intended).

On 30.04.20 21:48, Hannah Jiang wrote:
> Indeed, I can see the no space left on device in the following but
> not in the log above:
> 
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from DockerFile
> step are not printed out to the console.
> 
> On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  > wrote:
> 
> I checked node 8 and it had over 40GB space available. Does your job
> require more than that?
> 
> Long term, I'm thinking we could clean up workspaces for successful
> jobs. This should free up additional space (I guess at least 100GB).
> https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
> to clean workspaces at job start.
> 
> 
> On Thu, Apr 30, 2020, 07:33 Maximilian Michels  > wrote:
> 
> *It's working again, probably because it's running on a different
> machine now.
> 
> Who can check the disk space of the Jenkins hosts?
> 
> Thanks,
> Max
> 
> On 30.04.20 11:55, Maximilian Michels wrote:
> > Sorry, I meant to include the Jenkins log:
> >
> 
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> >
> > Thanks for investigating Hannah! Indeed, I can see the no
> space left on
> > device in the following but not in the log above:
> >
> 
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> >
> > I'm going to try running the build again. Do you think we
> could add more
> > storage to our Jenkins hosts or delete old build data?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 08:43, Hannah Jiang wrote:
> >> Max, I found a link from your PR and noticed below errors.
> This would be
> >> the true error.
> >>
> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >> *07:57:03*
> >> *07:57:03*  [0m
> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
> mailto:hannahji...@google.com>
> >>  >> wrote:
> >>
> >>     There is a PythonDocker Precommit test running for PRs
> with Python
> >>     changes. It seems running well.[1]
> >>     Max, can you please give me a link so I can check more
> details? Do
> >>     other images with different Python versions fail as well?
> >>
> >>   
>  1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> >>
> >>
> >>     On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
> mailto:al...@google.com>
> >>     >> wrote:
> >>
> >>         +Valentyn Tymofieiev  > +Hannah Jiang
> >>          > -- in case they have relevant
> >>         information.
> >>
> >>         On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> >>         mailto:m...@apache.org>
> >> wrote:
> >>
> >>             Hi,
> >>
> >>             has anyone noticed the Python 3.7 Docker
> container fails to
> >>             build? I
> >>             haven't been able to build the Python 3.7
> container, neither
> >>             locally nor
> >>             on Jenkins.
> >>
> >>             I get:
> >>
> >>             17:48:10 > Task :sdks:python:container:py37:docker
> >>             17:49:36 The command '/bin/sh -c pip install -r
> >>             /tmp/base_image_requirements.txt && python -c "from
> >>             google.protobuf.internal import
> api_implementation; assert
> >>             api_implementation._default_implementation_type
> == 'cpp'; print
> >>             ('Verified fast protobuf used.')" && rm -rf
>  

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
Is the issue that the workspace grows over time? Couldn't we delete it
daily to ensure it does not grow too much? Always deleting it on
successful runs may be too costly because we have to recreate the
workspace every time.

Logs are stored separately. I suppose they could also add up over time.

On 30.04.20 21:48, Hannah Jiang wrote:
> Indeed, I can see the no space left on device in the following but
> not in the log above:
> 
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from DockerFile
> step are not printed out to the console.
> 
> On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  > wrote:
> 
> I checked node 8 and it had over 40GB space available. Does your job
> require more than that?
> 
> Long term, I'm thinking we could clean up workspaces for successful
> jobs. This should free up additional space (I guess at least 100GB).
> https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
> to clean workspaces at job start.
> 
> 
> On Thu, Apr 30, 2020, 07:33 Maximilian Michels  > wrote:
> 
> *It's working again, probably because it's running on a different
> machine now.
> 
> Who can check the disk space of the Jenkins hosts?
> 
> Thanks,
> Max
> 
> On 30.04.20 11:55, Maximilian Michels wrote:
> > Sorry, I meant to include the Jenkins log:
> >
> 
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> >
> > Thanks for investigating Hannah! Indeed, I can see the no
> space left on
> > device in the following but not in the log above:
> >
> 
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> >
> > I'm going to try running the build again. Do you think we
> could add more
> > storage to our Jenkins hosts or delete old build data?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 08:43, Hannah Jiang wrote:
> >> Max, I found a link from your PR and noticed below errors.
> This would be
> >> the true error.
> >>
> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >> *07:57:03*
> >> *07:57:03*  [0m
> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
> mailto:hannahji...@google.com>
> >>  >> wrote:
> >>
> >>     There is a PythonDocker Precommit test running for PRs
> with Python
> >>     changes. It seems running well.[1]
> >>     Max, can you please give me a link so I can check more
> details? Do
> >>     other images with different Python versions fail as well?
> >>
> >>   
>  1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> >>
> >>
> >>     On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
> mailto:al...@google.com>
> >>     >> wrote:
> >>
> >>         +Valentyn Tymofieiev  > +Hannah Jiang
> >>          > -- in case they have relevant
> >>         information.
> >>
> >>         On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> >>         mailto:m...@apache.org>
> >> wrote:
> >>
> >>             Hi,
> >>
> >>             has anyone noticed the Python 3.7 Docker
> container fails to
> >>             build? I
> >>             haven't been able to build the Python 3.7
> container, neither
> >>             locally nor
> >>             on Jenkins.
> >>
> >>             I get:
> >>
> >>             17:48:10 > Task :sdks:python:container:py37:docker
> >>             17:49:36 The command '/bin/sh -c pip install -r
> >>             /tmp/base_image_requirements.txt && python -c "from
> >>             google.protobuf.internal import
> api_implementation; assert
> >>             api_implementation._default_implementation_type
> == 'cpp'; print
> >>             ('Verified fast 

Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Nam Bui
Hey guys,

I tried my best to handle renamed files in Git. I have no clue why GitHub
doesn't show it, but finally, I made this commit [1] (thanks for your
idea @bhulette) so you guys can review changes with ease (there is no bunch
of deleted markdown files anymore :D). Also, new staged version is
deployed, you could check it out [2].

In case you are interested in translation, here is the proof of concept [3]
(the earth icon on the right corner is temporarily used for switching
languages). You can take a look at the translation guide for this PoC [4].

[1]
https://github.com/apache/beam/pull/11554/commits/b267bb360866a723ac2536f408f23de648c7cd4d
[2]
http://apache-beam-website-pull-requests.storage.googleapis.com/11554/index.html
[3] https://safe-relation.surge.sh/
[4]
https://github.com/PolideaInternal/beam/blob/website-develop/website/CONTRIBUTE.md#translation-guide


On Thu, Apr 30, 2020 at 7:24 PM Brian Hulette  wrote:

> Changing the URLs is fine with me as long as the old urls will work too.
>
> But do we need to change the filenames for the blog posts to accomplish
> that? It's nice that the blog post markdown files start with a date so they
> naturally sort chronologically. It looks like this hugo PR [1] made it
> possible to extract date metadata and slug
> (i.e. dataflow-python-sdk-is-now-public) separately from the filename.
>
> [1] https://github.com/gohugoio/hugo/pull/4494
>
> On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
>>
>>> For changed URLs, will previous URLs be mapped to avoid broken external
>>> links?
>>>
>>
>> I believe the answer is yes from Nam's response "For now, we keep the old
>> URLs working in terms of redirecting them". I very much agree that this is
>> very important and should work for all existing urls.
>>
>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi,

 To give a little more context regarding the URLs, the date should still
 appear on the blog post, but not on the URL.
 For example, we'd have:

 https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
 become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.

>>>
>> I am not a content marketer. IMO, this is a good change. In the past, a
>> few times, we edited dates on posts (e.g. a release date was entered
>> incorrectly) and we had to either have a mismatch between dates in the url
>> and the date in the blog, or change the url. This change simplifies, by
>> having date only in place (in content metadata).
>>
>>
>>>
 The blog posts would have a small header showing the title, author and
 publish date. But the URL would not have it.
 Thoughts?


 On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:

> Hi,
>
> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version
> is "
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
> which also includes "/11554", and Hugo considers it as a path so it breaks
> the path of "static files" (like images). We made a fix. Now I'm working 
> on
> "getting git to recognize files as renames" as you suggested.
>
> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
> discussed with Aizhamal in the development stage and agreed to get rid of
> this. For now, we keep the old URLs working in terms of redirecting them.
> However, from now on, we should change the name convention on blog posts 
> to
> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>
>
>
> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
> wrote:
>
>> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:
>>
>>> Nam, this looks better. At least links are working, and the website
>>> visually looks similar and generally in good shape. I think there are 
>>> still
>>> issues. For example, I do not see any of the images (e.g. the beam logo 
>>> on
>>> top left is missing.)
>>>
>>> On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
>>> wrote:
>>>
 I left a comment on the PR [1]. I think the reason all of the
 website content is not being tracked as file renames is because there 
 was a
 series of commits that created files in the new directory, and then one
 commit that deleted the old directory. If there were a single commit 
 with
 all of the deleted and new files, git would surely recognize they are
 effectively renameds and mark them as such. Maybe we just need to get 
 all
 these commits squashed into one?

 [1]
 https://github.com/apache/beam/pull/11554#issuecomment-621489844

>>>
>>> Nam, could you try this? If we can get git to recognize these as
>>> renames, review process would be 

Re: contributor permission for Beam Jira tickets

2020-04-30 Thread Tyson Hamilton
Hi Darshan, I'll loop in some people for a discussion.

On 2020/04/30 04:29:13, Darshan Jani  wrote: 
> Thanks Luke.
> 
> Thanks for pointing that out.
> I am new to Beam contribution community.
> I would appreciate if you can point me or tag someone of the contributing
> members who are working on Beam SQL feature, so that we can have a
> discussion.
> My view is BEAM-9825 is more about set operations, which works on entire
> collections rather than keys like in joins.
> 
> -Regards
> Darshan
> 
> On Mon, Apr 27, 2020 at 11:57 PM Luke Cwik  wrote:
> 
> > Welcome, I granted you contributor permissions and assigned BEAM-9825.
> >
> > It looks like BEAM-9825 could have overlap with the Beam SQL effort
> > because it looks like different kinds of join functionality. Have you been
> > in contact with them?
> >
> > On Mon, Apr 27, 2020 at 8:28 AM Darshan Jani 
> > wrote:
> >
> >> Hi,
> >>
> >> Username : darshanjani
> >>
> >> -Regards
> >> Darshan
> >>
> >> On Mon, Apr 27, 2020 at 3:51 PM Darshan Jani 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> This is Hai from Thoughtworks. I am currently working on consulting and 
> >>> implementions of GCP, flink projects. We are also google partner. Can 
> >>> someone add me as a contributor for Beam's Jira issue tracker? I would 
> >>> like to create/assign tickets for my work.
> >>> I have created following JIRA issue in BEAM.
> >>> https://issues.apache.org/jira/browse/BEAM-9825
> >>>
> >>> Thanks,
> >>> Darshan
> >>>
> >>>
> 


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Kyle Weaver
> all runners (with perhaps the exception of the direct runner) are proxies
for actual runners

Agreed. The main difference is that this fact is more obvious for Dataflow
users, since it is "Cloud" Dataflow after all. The relationship of Beam to
its OSS runners is much less clear to new users (for example, folks are
often confused about the difference between Beam's Flink job server images
and Flink's own Docker images).

> though we could argue that the direct runner would be a reasonable default

Why set runner=PortableRuner then, when direct runner is the default?
Besides, the direct runner has its own murky status with regard to
portability, and its own defaults and branching paths, so I'd rather leave
that out of the equation.

On Thu, Apr 30, 2020 at 3:23 PM Robert Bradshaw  wrote:

> In a sense, all runners (with perhaps the exception of the direct runner)
> are proxies for actual runners. In that sense, I think it makes just as
> much sense to say "I want the portable runner with job endpoint X" as to
> say "I want the flink runner with master Y." Saying "I want the Portable
> Runner" without specifying an endpoint should, however, be undefined
> (though we could argue that the direct runner would be a reasonable
> default).
>
> On Thu, Apr 30, 2020 at 11:49 AM Ismaël Mejía  wrote:
>
>> Thomas has a point on the PortableRunner name, I was super confused
>> because of the `PortableRunner` not being a runner, I don't know if
>> too late but maybe it is still worth to give it a better name.
>>
>> On Thu, Apr 30, 2020 at 8:41 PM Thomas Weise  wrote:
>> >
>> > +1 for removing the default runner. It has always been the Beam user
>> expectation that a runner needs to be selected.
>> >
>> > "PortableRunner" isn't a runner (despite its name) - it's a proxy to a
>> runner that the user specifies via job_endpoint.
>> >
>> > Thanks for cleaning this up!
>> >
>> > On Thu, Apr 30, 2020 at 10:11 AM Kyle Weaver 
>> wrote:
>> >>
>> >> I'll bite :) Thanks for the feedback everyone!
>> >>
>> >> On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
>> >>>
>> >>> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía 
>> wrote:
>> 
>>  +1 for A there are zero reasons to have a default runner set by
>>  default, being explicit is better as Robert suggests and it resolves
>>  the confusion that the user reported.
>> 
>>  On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>  >
>>  > +1, I was actually thinking about this just the other day.
>> PortableRunner should require job_endpoint to be set, and we can have a
>> nice error message directing the explicit use of FlinkRunner for the old
>> behavior.
>>  >
>>  > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver 
>> wrote:
>>  >>
>>  >> > Could the error message suggest switching to FlinkRunner
>> (and/or other runners that start a job server for you)? Then it seems like
>> the breakage would only be a minor annoyance.
>>  >>
>>  >> Definitely.
>>  >>
>>  >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette <
>> bhule...@google.com> wrote:
>>  >>>
>>  >>> Could the error message suggest switching to FlinkRunner (and/or
>> other runners that start a job server for you)? Then it seems like the
>> breakage would only be a minor annoyance.
>>  >>>
>>  >>> Brian
>>  >>>
>>  >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver <
>> kcwea...@google.com> wrote:
>>  
>>   Hi all,
>>  
>>   Currently, when running a pipeline that has the options
>> runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
>> Dockerized Flink job server [1]. This is problematic because the
>> PortableRunner can be used by any portable runner. So for example, a Spark
>> runner user was recently baffled when their job ran successfully but
>> printed a bunch of Flink log messages.
>>  
>>   There are not too many uses of this default behavior to my
>> knowledge, at least within Beam itself. The only example I could find was
>> in the portableWordCount tests, which is mostly the same as
>> portableWordCountFlinkRunner tests [2]. The default behavior is entirely
>> superseded by the FlinkRunner class, which provides better encapsulation.
>>  
>>   I also noticed that DockerizedJobServer is only used by [3]. In
>> FlinkRunner, we pull the job server from Maven if necessary and call Java
>> directly. In general, I think there are already quite enough knobs in the
>> portability framework, so we should remove it unless there is reason to
>> prefer running the job server with Docker instead of calling Java directly.
>>  
>>   There are a couple options:
>>  
>>   A) Remove the default behavior and require job_endpoint to
>> always be set when using PortableRunner. This would be a breaking change.
>>   B) Keep t

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Hannah Jiang
>
> Indeed, I can see the no space left on device in the following but not in
> the log above:

--info tag was passed to docker image build commands with PythonDocker
Precommit to capture more logs. Without the tag, errors from DockerFile
step are not printed out to the console.

On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  wrote:

> I checked node 8 and it had over 40GB space available. Does your job
> require more than that?
>
> Long term, I'm thinking we could clean up workspaces for successful jobs.
> This should free up additional space (I guess at least 100GB).
> https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin to
> clean workspaces at job start.
>
>
> On Thu, Apr 30, 2020, 07:33 Maximilian Michels  wrote:
>
>> *It's working again, probably because it's running on a different
>> machine now.
>>
>> Who can check the disk space of the Jenkins hosts?
>>
>> Thanks,
>> Max
>>
>> On 30.04.20 11:55, Maximilian Michels wrote:
>> > Sorry, I meant to include the Jenkins log:
>> >
>> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
>> >
>> > Thanks for investigating Hannah! Indeed, I can see the no space left on
>> > device in the following but not in the log above:
>> >
>> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
>> >
>> > I'm going to try running the build again. Do you think we could add more
>> > storage to our Jenkins hosts or delete old build data?
>> >
>> > Thanks,
>> > Max
>> >
>> > On 30.04.20 08:43, Hannah Jiang wrote:
>> >> Max, I found a link from your PR and noticed below errors. This would
>> be
>> >> the true error.
>> >>
>> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
>> >> *07:57:03*  [91mERROR: Could not install packages due to an
>> EnvironmentError: [Errno 28] No space left on device
>> >> *07:57:03*
>> >> *07:57:03*  [0m
>> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
>> >> *07:57:03*  [91mERROR: Could not install packages due to an
>> EnvironmentError: [Errno 28] No space left on device
>> >>
>> >>
>> >>
>> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang > >> > wrote:
>> >>
>> >> There is a PythonDocker Precommit test running for PRs with Python
>> >> changes. It seems running well.[1]
>> >> Max, can you please give me a link so I can check more details? Do
>> >> other images with different Python versions fail as well?
>> >>
>> >> 1.
>> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
>> >>
>> >>
>> >> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay > >> > wrote:
>> >>
>> >> +Valentyn Tymofieiev  +Hannah
>> Jiang
>> >>  -- in case they have relevant
>> >> information.
>> >>
>> >> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
>> >> mailto:m...@apache.org>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> has anyone noticed the Python 3.7 Docker container fails to
>> >> build? I
>> >> haven't been able to build the Python 3.7 container,
>> neither
>> >> locally nor
>> >> on Jenkins.
>> >>
>> >> I get:
>> >>
>> >> 17:48:10 > Task :sdks:python:container:py37:docker
>> >> 17:49:36 The command '/bin/sh -c pip install -r
>> >> /tmp/base_image_requirements.txt && python -c "from
>> >> google.protobuf.internal import api_implementation; assert
>> >> api_implementation._default_implementation_type == 'cpp';
>> print
>> >> ('Verified fast protobuf used.')" && rm -rf
>> >> /root/.cache/pip' returned a
>> >> non-zero code: 1
>> >> 17:49:36
>> >> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
>> >>
>> >>
>> >> Cheers,
>> >> Max
>> >>
>>
>


Re: Jira PR links not being generated?

2020-04-30 Thread Ahmet Altay
Thank you!

On Wed, Apr 29, 2020 at 3:23 PM Kyle Weaver  wrote:

> Added a comment on https://issues.apache.org/jira/browse/INFRA-19967
>
> On Wed, Apr 29, 2020 at 5:57 PM Ahmet Altay  wrote:
>
>> Would it be worth filing this as an infra ticket?
>>
>> On Mon, Apr 27, 2020 at 4:29 PM Kyle Weaver  wrote:
>>
>>> Slight correction.. Jira to Github links are back. Github to Jira links
>>> (which were only recently added) are not being added.
>>>
>>> On Mon, Apr 27, 2020 at 7:20 PM Kyle Weaver  wrote:
>>>
 Well, links seem to work again at least.

 On Mon, Apr 27, 2020 at 7:10 PM Kyle Weaver 
 wrote:

> Thanks Pablo. I merged it. Everyone should keep an eye on Jira and the
> mailing lists to make sure I didn't accidentally mess something up.
>
> On Mon, Apr 27, 2020 at 7:01 PM Pablo Estrada 
> wrote:
>
>> I added it to add a list of non-committers that can trigger tests
>> from PR. I added like 100+, so they asked us to not do that : )
>> I've approved your PR, Kyle.
>>
>> On Mon, Apr 27, 2020 at 3:33 PM Kyle Weaver 
>> wrote:
>>
>>> I made a PR for this, though I still haven't found sufficient
>>> explanation as to why we did not need this file last week, and now we do
>>> this week. https://github.com/apache/beam/pull/11541
>>>
>>> On Mon, Apr 27, 2020 at 6:24 PM Udi Meiri  wrote:
>>>
 We had such a file for a short while but it was removed:
 https://github.com/apache/beam/pull/10645
 I don't believe it contained any PR link settings though
 +Pablo Estrada 

 On Mon, Apr 27, 2020 at 1:56 PM Kyle Weaver 
 wrote:

> I went ahead and filed
> https://issues.apache.org/jira/browse/BEAM-9833 since it looks
> like this is how things will be done from now on. Which raises the
> question, does anyone know how Beam managed these settings before? Or 
> were
> there previously no project-level controls?
>
> On Mon, Apr 27, 2020 at 4:39 PM Kyle Weaver 
> wrote:
>
>> Thanks for the pointer Kenn. I searched existing INFRA issues and
>> found [1] (among others). Looks like we may need to add a .asf.yaml 
>> file
>> [2]. I guess infra must have changed this recently without us 
>> picking up on
>> it?
>> 
>>
>> [1] https://issues.apache.org/jira/browse/INFRA-20171
>> [2]
>> https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-Notificationsettingsforrepositories
>>
>> On Mon, Apr 27, 2020 at 4:25 PM Kenneth Knowles 
>> wrote:
>>
>>> I suggest filing an issue with INFRA.
>>>
>>> Kenn
>>>
>>> On Fri, Apr 24, 2020 at 10:12 AM Kyle Weaver <
>>> kcwea...@google.com> wrote:
>>>
 Hi all,

 I've noticed links from Jira issues to related Github PRs have
 not been generated the past few days. Does anyone know why?

 Kyle

>>>


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Robert Bradshaw
In a sense, all runners (with perhaps the exception of the direct runner)
are proxies for actual runners. In that sense, I think it makes just as
much sense to say "I want the portable runner with job endpoint X" as to
say "I want the flink runner with master Y." Saying "I want the Portable
Runner" without specifying an endpoint should, however, be undefined
(though we could argue that the direct runner would be a reasonable
default).

On Thu, Apr 30, 2020 at 11:49 AM Ismaël Mejía  wrote:

> Thomas has a point on the PortableRunner name, I was super confused
> because of the `PortableRunner` not being a runner, I don't know if
> too late but maybe it is still worth to give it a better name.
>
> On Thu, Apr 30, 2020 at 8:41 PM Thomas Weise  wrote:
> >
> > +1 for removing the default runner. It has always been the Beam user
> expectation that a runner needs to be selected.
> >
> > "PortableRunner" isn't a runner (despite its name) - it's a proxy to a
> runner that the user specifies via job_endpoint.
> >
> > Thanks for cleaning this up!
> >
> > On Thu, Apr 30, 2020 at 10:11 AM Kyle Weaver 
> wrote:
> >>
> >> I'll bite :) Thanks for the feedback everyone!
> >>
> >> On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw 
> wrote:
> >>>
> >>> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
> >>>
> >>> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía 
> wrote:
> 
>  +1 for A there are zero reasons to have a default runner set by
>  default, being explicit is better as Robert suggests and it resolves
>  the confusion that the user reported.
> 
>  On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw 
> wrote:
>  >
>  > +1, I was actually thinking about this just the other day.
> PortableRunner should require job_endpoint to be set, and we can have a
> nice error message directing the explicit use of FlinkRunner for the old
> behavior.
>  >
>  > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver 
> wrote:
>  >>
>  >> > Could the error message suggest switching to FlinkRunner (and/or
> other runners that start a job server for you)? Then it seems like the
> breakage would only be a minor annoyance.
>  >>
>  >> Definitely.
>  >>
>  >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette 
> wrote:
>  >>>
>  >>> Could the error message suggest switching to FlinkRunner (and/or
> other runners that start a job server for you)? Then it seems like the
> breakage would only be a minor annoyance.
>  >>>
>  >>> Brian
>  >>>
>  >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver 
> wrote:
>  
>   Hi all,
>  
>   Currently, when running a pipeline that has the options
> runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
> Dockerized Flink job server [1]. This is problematic because the
> PortableRunner can be used by any portable runner. So for example, a Spark
> runner user was recently baffled when their job ran successfully but
> printed a bunch of Flink log messages.
>  
>   There are not too many uses of this default behavior to my
> knowledge, at least within Beam itself. The only example I could find was
> in the portableWordCount tests, which is mostly the same as
> portableWordCountFlinkRunner tests [2]. The default behavior is entirely
> superseded by the FlinkRunner class, which provides better encapsulation.
>  
>   I also noticed that DockerizedJobServer is only used by [3]. In
> FlinkRunner, we pull the job server from Maven if necessary and call Java
> directly. In general, I think there are already quite enough knobs in the
> portability framework, so we should remove it unless there is reason to
> prefer running the job server with Docker instead of calling Java directly.
>  
>   There are a couple options:
>  
>   A) Remove the default behavior and require job_endpoint to
> always be set when using PortableRunner. This would be a breaking change.
>   B) Keep the current behavior, but warn when the user sets
> runner=PortableRunner without job_endpoint. This is easy to miss, but it's
> better than nothing.
>  
>   What do you think?
>  
>   [1]
> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
>   [2]
> https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
>   [3]
> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163
>


Re: possible bug in AvroUtils

2020-04-30 Thread Ismaël Mejía
Created https://issues.apache.org/jira/browse/BEAM-9863 to track this.
Any taker?

On Thu, Apr 30, 2020 at 5:54 PM Reuven Lax  wrote:
>
> I'm not sure who added that, but it's been there for a while. Making global 
> static changes like that in our module seems like poor form - I wonder if 
> there's a better approach.
>
> On Thu, Apr 30, 2020 at 8:36 AM Brian Hulette  wrote:
>>
>> It seems likely this is a side effect of some static initialization in 
>> AvroUtils: 
>> https://github.com/apache/beam/blob/763b7ccd17a420eb634d6799adcd3ecfcf33d6a7/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L99
>>
>> On Wed, Apr 29, 2020 at 9:59 PM Reuven Lax  wrote:
>>>
>>> I've copied this failing test into my client, and it passes for me. I can't 
>>> reproduce the failure.
>>>
>>> On Wed, Apr 29, 2020 at 6:34 PM Luke Cwik  wrote:

 +dev +Brian Hulette +Reuven Lax

 On Wed, Apr 29, 2020 at 4:21 AM Paolo Tomeo  wrote:
>
> Hi all,
>
> I think the method AvroUtils.toBeamSchema has a not expected side effect.
> I found out that, if you invoke it and then you run a pipeline of 
> GenericRecords containing a timestamp (l tried with logical-type 
> timestamp-millis), Beam converts such timestamp from long to 
> org.joda.time.DateTime. Even if you don't apply any transformation to the 
> pipeline.
> Do you think it's a bug?
>
> Below you can find a simple test class I wrote in order to replicate the 
> problem.
> The first test passes while the second fails.
>
>
> import org.apache.avro.Schema;
> import org.apache.avro.SchemaBuilder;
> import org.apache.avro.generic.GenericRecord;
> import org.apache.avro.generic.GenericRecordBuilder;
> import org.apache.beam.sdk.coders.AvroCoder;
> import org.apache.beam.sdk.schemas.utils.AvroUtils;
> import org.apache.beam.sdk.testing.TestPipeline;
> import org.apache.beam.sdk.transforms.Combine;
> import org.apache.beam.sdk.transforms.Create;
> import org.apache.beam.sdk.transforms.SerializableFunction;
> import org.junit.Rule;
>
> import java.sql.Timestamp;
>
> import static org.junit.Assert.assertEquals;
>
> public class AvroUtilsSideEffect {
>
> @Rule
> public final transient TestPipeline pipeline = TestPipeline.create();
> @Rule
> public final transient TestPipeline pipeline2 = TestPipeline.create();
> public final Schema testSchema = SchemaBuilder
> .record("record").namespace("test")
> .fields()
> .name("timestamp").type().longBuilder().prop("logicalType", 
> "timestamp-millis").endLong().noDefault()
> .endRecord();
> public final GenericRecord record = new 
> GenericRecordBuilder(testSchema)
> .set("timestamp", new Timestamp(156392640L).getTime())
> .build();
>
>
> @org.junit.Test
> public void test() {
> pipeline.apply( 
> Create.of(record).withCoder(AvroCoder.of(testSchema)))
> .apply( Combine.globally(new TestFn()));
>
> pipeline.run().waitUntilFinish();
> }
> @org.junit.Test
> public void test2() {
>
> AvroUtils.toBeamSchema(testSchema);
>
> 
> pipeline2.apply(Create.of(record).withCoder(AvroCoder.of(testSchema)))
> .apply(Combine.globally(new TestFn()));
>
> pipeline2.run().waitUntilFinish();
> }
>
> public static class TestFn implements 
> SerializableFunction, GenericRecord> {
>
> @Override
> public GenericRecord apply(Iterable input) {
> for (GenericRecord item : input) {
> if(item != null){
> assertEquals(Long.class, 
> item.get("timestamp").getClass());
> assertEquals(156392640L, item.get("timestamp"));
> }
> return item;
> }
> return null;
> }
> }
> }
>
> Thanks,
> Paolo
>
> --
> Paolo Tomeo, PhD
>
> Big Data and Machine Learning Engineer
>
> linkedin.com/in/ptomeo


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Ismaël Mejía
Thomas has a point on the PortableRunner name, I was super confused
because of the `PortableRunner` not being a runner, I don't know if
too late but maybe it is still worth to give it a better name.

On Thu, Apr 30, 2020 at 8:41 PM Thomas Weise  wrote:
>
> +1 for removing the default runner. It has always been the Beam user 
> expectation that a runner needs to be selected.
>
> "PortableRunner" isn't a runner (despite its name) - it's a proxy to a runner 
> that the user specifies via job_endpoint.
>
> Thanks for cleaning this up!
>
> On Thu, Apr 30, 2020 at 10:11 AM Kyle Weaver  wrote:
>>
>> I'll bite :) Thanks for the feedback everyone!
>>
>> On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw  wrote:
>>>
>>> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
>>>
>>> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía  wrote:

 +1 for A there are zero reasons to have a default runner set by
 default, being explicit is better as Robert suggests and it resolves
 the confusion that the user reported.

 On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw  
 wrote:
 >
 > +1, I was actually thinking about this just the other day. 
 > PortableRunner should require job_endpoint to be set, and we can have a 
 > nice error message directing the explicit use of FlinkRunner for the old 
 > behavior.
 >
 > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver  wrote:
 >>
 >> > Could the error message suggest switching to FlinkRunner (and/or 
 >> > other runners that start a job server for you)? Then it seems like 
 >> > the breakage would only be a minor annoyance.
 >>
 >> Definitely.
 >>
 >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette  
 >> wrote:
 >>>
 >>> Could the error message suggest switching to FlinkRunner (and/or other 
 >>> runners that start a job server for you)? Then it seems like the 
 >>> breakage would only be a minor annoyance.
 >>>
 >>> Brian
 >>>
 >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver  
 >>> wrote:
 
  Hi all,
 
  Currently, when running a pipeline that has the options 
  runner=PortableRunner and job_endpoint unset, the Python SDK spins up 
  a Dockerized Flink job server [1]. This is problematic because the 
  PortableRunner can be used by any portable runner. So for example, a 
  Spark runner user was recently baffled when their job ran 
  successfully but printed a bunch of Flink log messages.
 
  There are not too many uses of this default behavior to my knowledge, 
  at least within Beam itself. The only example I could find was in the 
  portableWordCount tests, which is mostly the same as 
  portableWordCountFlinkRunner tests [2]. The default behavior is 
  entirely superseded by the FlinkRunner class, which provides better 
  encapsulation.
 
  I also noticed that DockerizedJobServer is only used by [3]. In 
  FlinkRunner, we pull the job server from Maven if necessary and call 
  Java directly. In general, I think there are already quite enough 
  knobs in the portability framework, so we should remove it unless 
  there is reason to prefer running the job server with Docker instead 
  of calling Java directly.
 
  There are a couple options:
 
  A) Remove the default behavior and require job_endpoint to always be 
  set when using PortableRunner. This would be a breaking change.
  B) Keep the current behavior, but warn when the user sets 
  runner=PortableRunner without job_endpoint. This is easy to miss, but 
  it's better than nothing.
 
  What do you think?
 
  [1] 
  https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
  [2] 
  https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
  [3] 
  https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Thomas Weise
+1 for removing the default runner. It has always been the Beam user
expectation that a runner needs to be selected.

"PortableRunner" isn't a runner (despite its name) - it's a proxy to a
runner that the user specifies via job_endpoint.

Thanks for cleaning this up!

On Thu, Apr 30, 2020 at 10:11 AM Kyle Weaver  wrote:

> I'll bite :) Thanks for the feedback everyone!
>
> On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw 
> wrote:
>
>> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
>>
>> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía  wrote:
>>
>>> +1 for A there are zero reasons to have a default runner set by
>>> default, being explicit is better as Robert suggests and it resolves
>>> the confusion that the user reported.
>>>
>>> On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw 
>>> wrote:
>>> >
>>> > +1, I was actually thinking about this just the other day.
>>> PortableRunner should require job_endpoint to be set, and we can have a
>>> nice error message directing the explicit use of FlinkRunner for the old
>>> behavior.
>>> >
>>> > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver 
>>> wrote:
>>> >>
>>> >> > Could the error message suggest switching to FlinkRunner (and/or
>>> other runners that start a job server for you)? Then it seems like the
>>> breakage would only be a minor annoyance.
>>> >>
>>> >> Definitely.
>>> >>
>>> >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette 
>>> wrote:
>>> >>>
>>> >>> Could the error message suggest switching to FlinkRunner (and/or
>>> other runners that start a job server for you)? Then it seems like the
>>> breakage would only be a minor annoyance.
>>> >>>
>>> >>> Brian
>>> >>>
>>> >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver 
>>> wrote:
>>> 
>>>  Hi all,
>>> 
>>>  Currently, when running a pipeline that has the options
>>> runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
>>> Dockerized Flink job server [1]. This is problematic because the
>>> PortableRunner can be used by any portable runner. So for example, a Spark
>>> runner user was recently baffled when their job ran successfully but
>>> printed a bunch of Flink log messages.
>>> 
>>>  There are not too many uses of this default behavior to my
>>> knowledge, at least within Beam itself. The only example I could find was
>>> in the portableWordCount tests, which is mostly the same as
>>> portableWordCountFlinkRunner tests [2]. The default behavior is entirely
>>> superseded by the FlinkRunner class, which provides better encapsulation.
>>> 
>>>  I also noticed that DockerizedJobServer is only used by [3]. In
>>> FlinkRunner, we pull the job server from Maven if necessary and call Java
>>> directly. In general, I think there are already quite enough knobs in the
>>> portability framework, so we should remove it unless there is reason to
>>> prefer running the job server with Docker instead of calling Java directly.
>>> 
>>>  There are a couple options:
>>> 
>>>  A) Remove the default behavior and require job_endpoint to always
>>> be set when using PortableRunner. This would be a breaking change.
>>>  B) Keep the current behavior, but warn when the user sets
>>> runner=PortableRunner without job_endpoint. This is easy to miss, but it's
>>> better than nothing.
>>> 
>>>  What do you think?
>>> 
>>>  [1]
>>> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
>>>  [2]
>>> https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
>>>  [3]
>>> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163
>>>
>>


Re: Python 3.7 docker container fails to build

2020-04-30 Thread Udi Meiri
I checked node 8 and it had over 40GB space available. Does your job
require more than that?

Long term, I'm thinking we could clean up workspaces for successful jobs.
This should free up additional space (I guess at least 100GB).
https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin to
clean workspaces at job start.


On Thu, Apr 30, 2020, 07:33 Maximilian Michels  wrote:

> *It's working again, probably because it's running on a different
> machine now.
>
> Who can check the disk space of the Jenkins hosts?
>
> Thanks,
> Max
>
> On 30.04.20 11:55, Maximilian Michels wrote:
> > Sorry, I meant to include the Jenkins log:
> >
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> >
> > Thanks for investigating Hannah! Indeed, I can see the no space left on
> > device in the following but not in the log above:
> >
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> >
> > I'm going to try running the build again. Do you think we could add more
> > storage to our Jenkins hosts or delete old build data?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 08:43, Hannah Jiang wrote:
> >> Max, I found a link from your PR and noticed below errors. This would be
> >> the true error.
> >>
> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >> *07:57:03*
> >> *07:57:03*  [0m
> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang  >> > wrote:
> >>
> >> There is a PythonDocker Precommit test running for PRs with Python
> >> changes. It seems running well.[1]
> >> Max, can you please give me a link so I can check more details? Do
> >> other images with different Python versions fail as well?
> >>
> >> 1.
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> >>
> >>
> >> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay  >> > wrote:
> >>
> >> +Valentyn Tymofieiev  +Hannah Jiang
> >>  -- in case they have relevant
> >> information.
> >>
> >> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> >> mailto:m...@apache.org>> wrote:
> >>
> >> Hi,
> >>
> >> has anyone noticed the Python 3.7 Docker container fails to
> >> build? I
> >> haven't been able to build the Python 3.7 container, neither
> >> locally nor
> >> on Jenkins.
> >>
> >> I get:
> >>
> >> 17:48:10 > Task :sdks:python:container:py37:docker
> >> 17:49:36 The command '/bin/sh -c pip install -r
> >> /tmp/base_image_requirements.txt && python -c "from
> >> google.protobuf.internal import api_implementation; assert
> >> api_implementation._default_implementation_type == 'cpp';
> print
> >> ('Verified fast protobuf used.')" && rm -rf
> >> /root/.cache/pip' returned a
> >> non-zero code: 1
> >> 17:49:36
> >> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
> >>
> >>
> >> Cheers,
> >> Max
> >>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Brian Hulette
Changing the URLs is fine with me as long as the old urls will work too.

But do we need to change the filenames for the blog posts to accomplish
that? It's nice that the blog post markdown files start with a date so they
naturally sort chronologically. It looks like this hugo PR [1] made it
possible to extract date metadata and slug
(i.e. dataflow-python-sdk-is-now-public) separately from the filename.

[1] https://github.com/gohugoio/hugo/pull/4494

On Thu, Apr 30, 2020 at 10:06 AM Ahmet Altay  wrote:

>
>
> On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:
>
>> For changed URLs, will previous URLs be mapped to avoid broken external
>> links?
>>
>
> I believe the answer is yes from Nam's response "For now, we keep the old
> URLs working in terms of redirecting them". I very much agree that this is
> very important and should work for all existing urls.
>
>
>>
>>
>> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> To give a little more context regarding the URLs, the date should still
>>> appear on the blog post, but not on the URL.
>>> For example, we'd have:
>>>
>>> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>>> become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>>
>>
> I am not a content marketer. IMO, this is a good change. In the past, a
> few times, we edited dates on posts (e.g. a release date was entered
> incorrectly) and we had to either have a mismatch between dates in the url
> and the date in the blog, or change the url. This change simplifies, by
> having date only in place (in content metadata).
>
>
>>
>>> The blog posts would have a small header showing the title, author and
>>> publish date. But the URL would not have it.
>>> Thoughts?
>>>
>>>
>>> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>>
 Hi,

 @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version
 is "
 http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
 which also includes "/11554", and Hugo considers it as a path so it breaks
 the path of "static files" (like images). We made a fix. Now I'm working on
 "getting git to recognize files as renames" as you suggested.

 @robert: The dates are nice but it causes verbose/long/ugly URLs. We
 discussed with Aizhamal in the development stage and agreed to get rid of
 this. For now, we keep the old URLs working in terms of redirecting them.
 However, from now on, we should change the name convention on blog posts to
 have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)



 On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
 wrote:

> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:
>
>> Nam, this looks better. At least links are working, and the website
>> visually looks similar and generally in good shape. I think there are 
>> still
>> issues. For example, I do not see any of the images (e.g. the beam logo 
>> on
>> top left is missing.)
>>
>> On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
>> wrote:
>>
>>> I left a comment on the PR [1]. I think the reason all of the
>>> website content is not being tracked as file renames is because there 
>>> was a
>>> series of commits that created files in the new directory, and then one
>>> commit that deleted the old directory. If there were a single commit 
>>> with
>>> all of the deleted and new files, git would surely recognize they are
>>> effectively renameds and mark them as such. Maybe we just need to get 
>>> all
>>> these commits squashed into one?
>>>
>>> [1] https://github.com/apache/beam/pull/11554#issuecomment-621489844
>>>
>>
>> Nam, could you try this? If we can get git to recognize these as
>> renames, review process would be much easier.
>>
>
> +1.
>
> Alternatively, create a commit that just moves the files into a new
> location (which git can always detect), then sit the edits on top of that
> (which should preserve history better).
>
> Also, is there a reason the dates were removed from the blog post
> filenames? For content like that, the dates are nice.
>
>
>>
>>
>>>
>>> On Wed, Apr 29, 2020 at 10:39 AM Nam Bui 
>>> wrote:
>>>
 Hi guys,

 I'm Nam - from the responsible team of Apache Beam website
 migration. I am pleased to answer some of the questions here.

 @aizhamal: Thanks for informing to the community. :)
 @altay, @robertwb: Yes. there is a problem with the staged version
 at the moment. We didn't expect some behaviours on the build process. 
 So,
 we fixed it today and been waiting for @pablo to re-run it again. The
 purpose of this PR is to migrate completely Beam site from Jekyll to 
 Hugo.
>

Re: Jacek's new Apache Beam Internals Project

2020-04-30 Thread Reuven Lax
I took a look at the book - there's not really much there. Maybe he's
planning on adding more over time?

On Tue, Apr 28, 2020 at 1:48 PM Ismaël Mejía  wrote:

> The tweet URL for ref in case someone wants to like/RT
>
> https://twitter.com/jaceklaskowski/status/1255046717277376512?s=19
>
> On Tue, Apr 28, 2020, 8:04 PM Holden Karau  wrote:
>
>> Hi Folks,
>>
>> I just saw Jacek's tweet about his new Beam Internals project (he's done
>> a great job on his Spark Internals documentation and blog posts) and I
>> figured I'd share the link
>> https://leanpub.com/the-internals-of-apache-beam in case folks are
>> interested :)
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Kyle Weaver
I'll bite :) Thanks for the feedback everyone!

On Thu, Apr 30, 2020 at 1:01 PM Robert Bradshaw  wrote:

> I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?
>
> On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía  wrote:
>
>> +1 for A there are zero reasons to have a default runner set by
>> default, being explicit is better as Robert suggests and it resolves
>> the confusion that the user reported.
>>
>> On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw 
>> wrote:
>> >
>> > +1, I was actually thinking about this just the other day.
>> PortableRunner should require job_endpoint to be set, and we can have a
>> nice error message directing the explicit use of FlinkRunner for the old
>> behavior.
>> >
>> > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver 
>> wrote:
>> >>
>> >> > Could the error message suggest switching to FlinkRunner (and/or
>> other runners that start a job server for you)? Then it seems like the
>> breakage would only be a minor annoyance.
>> >>
>> >> Definitely.
>> >>
>> >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette 
>> wrote:
>> >>>
>> >>> Could the error message suggest switching to FlinkRunner (and/or
>> other runners that start a job server for you)? Then it seems like the
>> breakage would only be a minor annoyance.
>> >>>
>> >>> Brian
>> >>>
>> >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver 
>> wrote:
>> 
>>  Hi all,
>> 
>>  Currently, when running a pipeline that has the options
>> runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
>> Dockerized Flink job server [1]. This is problematic because the
>> PortableRunner can be used by any portable runner. So for example, a Spark
>> runner user was recently baffled when their job ran successfully but
>> printed a bunch of Flink log messages.
>> 
>>  There are not too many uses of this default behavior to my
>> knowledge, at least within Beam itself. The only example I could find was
>> in the portableWordCount tests, which is mostly the same as
>> portableWordCountFlinkRunner tests [2]. The default behavior is entirely
>> superseded by the FlinkRunner class, which provides better encapsulation.
>> 
>>  I also noticed that DockerizedJobServer is only used by [3]. In
>> FlinkRunner, we pull the job server from Maven if necessary and call Java
>> directly. In general, I think there are already quite enough knobs in the
>> portability framework, so we should remove it unless there is reason to
>> prefer running the job server with Docker instead of calling Java directly.
>> 
>>  There are a couple options:
>> 
>>  A) Remove the default behavior and require job_endpoint to always be
>> set when using PortableRunner. This would be a breaking change.
>>  B) Keep the current behavior, but warn when the user sets
>> runner=PortableRunner without job_endpoint. This is easy to miss, but it's
>> better than nothing.
>> 
>>  What do you think?
>> 
>>  [1]
>> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
>>  [2]
>> https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
>>  [3]
>> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163
>>
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Ahmet Altay
On Thu, Apr 30, 2020 at 9:55 AM Thomas Weise  wrote:

> For changed URLs, will previous URLs be mapped to avoid broken external
> links?
>

I believe the answer is yes from Nam's response "For now, we keep the old
URLs working in terms of redirecting them". I very much agree that this is
very important and should work for all existing urls.


>
>
> On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Hi,
>>
>> To give a little more context regarding the URLs, the date should still
>> appear on the blog post, but not on the URL.
>> For example, we'd have:
>>
>> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
>> become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>>
>
I am not a content marketer. IMO, this is a good change. In the past, a few
times, we edited dates on posts (e.g. a release date was entered
incorrectly) and we had to either have a mismatch between dates in the url
and the date in the blog, or change the url. This change simplifies, by
having date only in place (in content metadata).


>
>> The blog posts would have a small header showing the title, author and
>> publish date. But the URL would not have it.
>> Thoughts?
>>
>>
>> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>>
>>> Hi,
>>>
>>> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version is
>>> "http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
>>> which also includes "/11554", and Hugo considers it as a path so it breaks
>>> the path of "static files" (like images). We made a fix. Now I'm working on
>>> "getting git to recognize files as renames" as you suggested.
>>>
>>> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
>>> discussed with Aizhamal in the development stage and agreed to get rid of
>>> this. For now, we keep the old URLs working in terms of redirecting them.
>>> However, from now on, we should change the name convention on blog posts to
>>> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>>>
>>>
>>>
>>> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
>>> wrote:
>>>
 On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:

> Nam, this looks better. At least links are working, and the website
> visually looks similar and generally in good shape. I think there are 
> still
> issues. For example, I do not see any of the images (e.g. the beam logo on
> top left is missing.)
>
> On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
> wrote:
>
>> I left a comment on the PR [1]. I think the reason all of the website
>> content is not being tracked as file renames is because there was a 
>> series
>> of commits that created files in the new directory, and then one commit
>> that deleted the old directory. If there were a single commit with all of
>> the deleted and new files, git would surely recognize they are 
>> effectively
>> renameds and mark them as such. Maybe we just need to get all these 
>> commits
>> squashed into one?
>>
>> [1] https://github.com/apache/beam/pull/11554#issuecomment-621489844
>>
>
> Nam, could you try this? If we can get git to recognize these as
> renames, review process would be much easier.
>

 +1.

 Alternatively, create a commit that just moves the files into a new
 location (which git can always detect), then sit the edits on top of that
 (which should preserve history better).

 Also, is there a reason the dates were removed from the blog post
 filenames? For content like that, the dates are nice.


>
>
>>
>> On Wed, Apr 29, 2020 at 10:39 AM Nam Bui  wrote:
>>
>>> Hi guys,
>>>
>>> I'm Nam - from the responsible team of Apache Beam website
>>> migration. I am pleased to answer some of the questions here.
>>>
>>> @aizhamal: Thanks for informing to the community. :)
>>> @altay, @robertwb: Yes. there is a problem with the staged version
>>> at the moment. We didn't expect some behaviours on the build process. 
>>> So,
>>> we fixed it today and been waiting for @pablo to re-run it again. The
>>> purpose of this PR is to migrate completely Beam site from Jekyll to 
>>> Hugo.
>>> Therefore, a bunch of deleted markdown files are from Jekyll which was
>>> located at `beam/website/src`, and Hugo is located at `beam/website/www`
>>> now. In `beam/website/README.md`, I wrote down about running the Hugo
>>> website locally, although it is actually same as Jekyll (because it's 
>>> also
>>> set up with Docker & Gradle). In `beam/website/CONTRIBUTE.md`, I guided
>>> people on how to get started with Hugo on the Beam website. There is 
>>> also a
>>> link in the "Translation Guide" section which points to a branch of
>>> multilingual provenance, and it will become a next PR soon.
>>>
>>> Ple

Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Robert Bradshaw
I filed https://issues.apache.org/jira/browse/BEAM-9860. Any takers?

On Thu, Apr 30, 2020 at 5:49 AM Ismaël Mejía  wrote:

> +1 for A there are zero reasons to have a default runner set by
> default, being explicit is better as Robert suggests and it resolves
> the confusion that the user reported.
>
> On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw 
> wrote:
> >
> > +1, I was actually thinking about this just the other day.
> PortableRunner should require job_endpoint to be set, and we can have a
> nice error message directing the explicit use of FlinkRunner for the old
> behavior.
> >
> > On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver 
> wrote:
> >>
> >> > Could the error message suggest switching to FlinkRunner (and/or
> other runners that start a job server for you)? Then it seems like the
> breakage would only be a minor annoyance.
> >>
> >> Definitely.
> >>
> >> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette 
> wrote:
> >>>
> >>> Could the error message suggest switching to FlinkRunner (and/or other
> runners that start a job server for you)? Then it seems like the breakage
> would only be a minor annoyance.
> >>>
> >>> Brian
> >>>
> >>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver 
> wrote:
> 
>  Hi all,
> 
>  Currently, when running a pipeline that has the options
> runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
> Dockerized Flink job server [1]. This is problematic because the
> PortableRunner can be used by any portable runner. So for example, a Spark
> runner user was recently baffled when their job ran successfully but
> printed a bunch of Flink log messages.
> 
>  There are not too many uses of this default behavior to my knowledge,
> at least within Beam itself. The only example I could find was in the
> portableWordCount tests, which is mostly the same as
> portableWordCountFlinkRunner tests [2]. The default behavior is entirely
> superseded by the FlinkRunner class, which provides better encapsulation.
> 
>  I also noticed that DockerizedJobServer is only used by [3]. In
> FlinkRunner, we pull the job server from Maven if necessary and call Java
> directly. In general, I think there are already quite enough knobs in the
> portability framework, so we should remove it unless there is reason to
> prefer running the job server with Docker instead of calling Java directly.
> 
>  There are a couple options:
> 
>  A) Remove the default behavior and require job_endpoint to always be
> set when using PortableRunner. This would be a breaking change.
>  B) Keep the current behavior, but warn when the user sets
> runner=PortableRunner without job_endpoint. This is easy to miss, but it's
> better than nothing.
> 
>  What do you think?
> 
>  [1]
> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
>  [2]
> https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
>  [3]
> https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Thomas Weise
For changed URLs, will previous URLs be mapped to avoid broken external
links?


On Thu, Apr 30, 2020 at 9:34 AM Aizhamal Nurmamat kyzy 
wrote:

> Hi,
>
> To give a little more context regarding the URLs, the date should still
> appear on the blog post, but not on the URL.
> For example, we'd have:
>
> https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
> become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.
>
> The blog posts would have a small header showing the title, author and
> publish date. But the URL would not have it.
> Thoughts?
>
>
> On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:
>
>> Hi,
>>
>> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version is "
>> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
>> which also includes "/11554", and Hugo considers it as a path so it breaks
>> the path of "static files" (like images). We made a fix. Now I'm working on
>> "getting git to recognize files as renames" as you suggested.
>>
>> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
>> discussed with Aizhamal in the development stage and agreed to get rid of
>> this. For now, we keep the old URLs working in terms of redirecting them.
>> However, from now on, we should change the name convention on blog posts to
>> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>>
>>
>>
>> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:
>>>
 Nam, this looks better. At least links are working, and the website
 visually looks similar and generally in good shape. I think there are still
 issues. For example, I do not see any of the images (e.g. the beam logo on
 top left is missing.)

 On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
 wrote:

> I left a comment on the PR [1]. I think the reason all of the website
> content is not being tracked as file renames is because there was a series
> of commits that created files in the new directory, and then one commit
> that deleted the old directory. If there were a single commit with all of
> the deleted and new files, git would surely recognize they are effectively
> renameds and mark them as such. Maybe we just need to get all these 
> commits
> squashed into one?
>
> [1] https://github.com/apache/beam/pull/11554#issuecomment-621489844
>

 Nam, could you try this? If we can get git to recognize these as
 renames, review process would be much easier.

>>>
>>> +1.
>>>
>>> Alternatively, create a commit that just moves the files into a new
>>> location (which git can always detect), then sit the edits on top of that
>>> (which should preserve history better).
>>>
>>> Also, is there a reason the dates were removed from the blog post
>>> filenames? For content like that, the dates are nice.
>>>
>>>


>
> On Wed, Apr 29, 2020 at 10:39 AM Nam Bui  wrote:
>
>> Hi guys,
>>
>> I'm Nam - from the responsible team of Apache Beam website migration.
>> I am pleased to answer some of the questions here.
>>
>> @aizhamal: Thanks for informing to the community. :)
>> @altay, @robertwb: Yes. there is a problem with the staged version at
>> the moment. We didn't expect some behaviours on the build process. So, we
>> fixed it today and been waiting for @pablo to re-run it again. The 
>> purpose
>> of this PR is to migrate completely Beam site from Jekyll to Hugo.
>> Therefore, a bunch of deleted markdown files are from Jekyll which was
>> located at `beam/website/src`, and Hugo is located at `beam/website/www`
>> now. In `beam/website/README.md`, I wrote down about running the Hugo
>> website locally, although it is actually same as Jekyll (because it's 
>> also
>> set up with Docker & Gradle). In `beam/website/CONTRIBUTE.md`, I guided
>> people on how to get started with Hugo on the Beam website. There is 
>> also a
>> link in the "Translation Guide" section which points to a branch of
>> multilingual provenance, and it will become a next PR soon.
>>
>> Please let me know if you need more details. Feel free to ask any
>> questions and I will get back to you with answers. I'm so sorry if I 
>> answer
>> a little bit due to the timezone. :)
>>
>> Best regards,
>> Nam
>>
>>
>>
>> On Tue, Apr 28, 2020 at 8:49 PM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Adding +Nam Bui  and +Karolina Rosół
>>>  to follow up on questions.
>>>
>>> On Tue, Apr 28, 2020 at 11:34 AM Ahmet Altay 
>>> wrote:
>>>
 I am having trouble reviewing the staged version. What is the best
 way to review this change?

 Do we expect any changes to markdown files, beyond some metadata?

 On Tue, Apr 28, 2020 at 

Re: Greetings from Tyson

2020-04-30 Thread Ruoyun Huang
Welcome Tyson!

On Thu, Apr 30, 2020 at 6:44 AM Connell O'Callaghan 
wrote:

> Welcome Tyson!!!
>
>
>
> On Thu, Apr 30, 2020 at 6:12 AM Ismaël Mejía  wrote:
>
>> Welcome!
>>
>> On Thu, Apr 30, 2020 at 12:27 AM Alan Myrvold 
>> wrote:
>> >
>> > Welcome, Tyson!
>> >
>> > On Wed, Apr 29, 2020 at 3:15 PM Rui Wang  wrote:
>> >>
>> >> Welcome!
>> >>
>> >> -Rui
>> >>
>> >> On Wed, Apr 29, 2020, 3:13 PM Brian Hulette 
>> wrote:
>> >>>
>> >>> Welcome Tyson!
>> >>>
>> >>> On Wed, Apr 29, 2020 at 2:54 PM Ahmet Altay  wrote:
>> 
>>  Welcome!
>> 
>>  On Tue, Apr 28, 2020 at 3:06 PM Hannah Jiang 
>> wrote:
>> >
>> > Welcome to the community!
>> >
>> >
>> > On Tue, Apr 28, 2020 at 2:45 PM Tyson Hamilton 
>> wrote:
>> >>
>> >> Hello Beam Community,
>> >>
>> >> This is just a simple 'Hello' to introduce myself. I'm a Software
>> Engineer at Google and have worked with data processing languages and
>> runtime systems on and off during my career. I now have the pleasure of
>> dedicating more time towards working with you lovely folks on Beam and I'm
>> really excited!
>> >>
>> >> I hope you're all doing well and staying safe in these difficult
>> times.
>> >>
>> >> -Tyson
>> >>
>> >>
>> >>
>>
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Aizhamal Nurmamat kyzy
Hi,

To give a little more context regarding the URLs, the date should still
appear on the blog post, but not on the URL.
For example, we'd have:
https://beam.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html
become https://beam.apache.org/blog/dataflow-python-sdk-is-now-public/.

The blog posts would have a small header showing the title, author and
publish date. But the URL would not have it.
Thoughts?


On Thu, Apr 30, 2020 at 9:23 AM Nam Bui  wrote:

> Hi,
>
> @altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version is "
> http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
> which also includes "/11554", and Hugo considers it as a path so it breaks
> the path of "static files" (like images). We made a fix. Now I'm working on
> "getting git to recognize files as renames" as you suggested.
>
> @robert: The dates are nice but it causes verbose/long/ugly URLs. We
> discussed with Aizhamal in the development stage and agreed to get rid of
> this. For now, we keep the old URLs working in terms of redirecting them.
> However, from now on, we should change the name convention on blog posts to
> have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)
>
>
>
> On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw 
> wrote:
>
>> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:
>>
>>> Nam, this looks better. At least links are working, and the website
>>> visually looks similar and generally in good shape. I think there are still
>>> issues. For example, I do not see any of the images (e.g. the beam logo on
>>> top left is missing.)
>>>
>>> On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
>>> wrote:
>>>
 I left a comment on the PR [1]. I think the reason all of the website
 content is not being tracked as file renames is because there was a series
 of commits that created files in the new directory, and then one commit
 that deleted the old directory. If there were a single commit with all of
 the deleted and new files, git would surely recognize they are effectively
 renameds and mark them as such. Maybe we just need to get all these commits
 squashed into one?

 [1] https://github.com/apache/beam/pull/11554#issuecomment-621489844

>>>
>>> Nam, could you try this? If we can get git to recognize these as
>>> renames, review process would be much easier.
>>>
>>
>> +1.
>>
>> Alternatively, create a commit that just moves the files into a new
>> location (which git can always detect), then sit the edits on top of that
>> (which should preserve history better).
>>
>> Also, is there a reason the dates were removed from the blog post
>> filenames? For content like that, the dates are nice.
>>
>>
>>>
>>>

 On Wed, Apr 29, 2020 at 10:39 AM Nam Bui  wrote:

> Hi guys,
>
> I'm Nam - from the responsible team of Apache Beam website migration.
> I am pleased to answer some of the questions here.
>
> @aizhamal: Thanks for informing to the community. :)
> @altay, @robertwb: Yes. there is a problem with the staged version at
> the moment. We didn't expect some behaviours on the build process. So, we
> fixed it today and been waiting for @pablo to re-run it again. The purpose
> of this PR is to migrate completely Beam site from Jekyll to Hugo.
> Therefore, a bunch of deleted markdown files are from Jekyll which was
> located at `beam/website/src`, and Hugo is located at `beam/website/www`
> now. In `beam/website/README.md`, I wrote down about running the Hugo
> website locally, although it is actually same as Jekyll (because it's also
> set up with Docker & Gradle). In `beam/website/CONTRIBUTE.md`, I guided
> people on how to get started with Hugo on the Beam website. There is also 
> a
> link in the "Translation Guide" section which points to a branch of
> multilingual provenance, and it will become a next PR soon.
>
> Please let me know if you need more details. Feel free to ask any
> questions and I will get back to you with answers. I'm so sorry if I 
> answer
> a little bit due to the timezone. :)
>
> Best regards,
> Nam
>
>
>
> On Tue, Apr 28, 2020 at 8:49 PM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Adding +Nam Bui  and +Karolina Rosół
>>  to follow up on questions.
>>
>> On Tue, Apr 28, 2020 at 11:34 AM Ahmet Altay 
>> wrote:
>>
>>> I am having trouble reviewing the staged version. What is the best
>>> way to review this change?
>>>
>>> Do we expect any changes to markdown files, beyond some metadata?
>>>
>>> On Tue, Apr 28, 2020 at 10:45 AM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 Thanks. It'll be great to better support more languages.

 I looked at the PR and there seems to be no provenance/history.
 E.g. all the content seems to be entirely new files rather than diffs 

Re: Companies using Beam?

2020-04-30 Thread Austin Bennett
A first pass, something like:

https://druid.apache.org/druid-powered
https://spark.apache.org/powered-by.html

or even as simple as:
https://github.com/apache/airflow#who-uses-apache-airflow

Would go a long way in the sorts of very high-level conversations I'm
having around technology adoption/standardization.

Getting into more specifics/testimonials/case-studies is also great, but I
wouldn't expect those to get looked at by most, until passing the first bar
of seeming to having a significant adoption.

@Aizhamal Nurmamat kyzy   - happy to contribute as I
can.

On Tue, Apr 28, 2020 at 10:13 PM Jean-Baptiste Onofre 
wrote:

> Hi,
>
> We already have some testimonials on Beam home page (I did the one about
> Beam use at Talend).
>
> It makes sense to have a dedicated section as it gives ideas about use
> case and production system running with Beam.
>
> Regards
> JB
>
> > Le 28 avr. 2020 à 23:42, Austin Bennett  a
> écrit :
> >
> > Hi All,
> >
> > Have we considered getting onto our website or our our GitHub repo the
> ability for individuals to share that their company is using Beam?  Seeing
> - what I believe to be a reasonable list of - companies productively using
> Beam would be helpful to point others to.  For instance, a common question
> I get is whether anyone or who is using?  I'm not sure that's the best
> metric or datapoint in many cases for adoption, but a heuristic that some
> rely upon.
> >
> > Naturally, we could ask for a roll-call, esp. via user list, but
> imagining  a persistent web-list would be of interest.
> >
> > Cheers,
> > Austin
> >
> >
> > P.S.  If putting such a list into our repo, that would also get some
> people to submit PRs (so more contributors!) :-)
> >
> >
>
>


Re: [REVIEW][please pause website changes] Migrated the Beam website to Hugo

2020-04-30 Thread Nam Bui
Hi,

@altay: Hey hey. Yeah, I didn't expect the baseUrl of staging version is "
http://apache-beam-website-pull-requests.storage.googleapis.com/11554/";
which also includes "/11554", and Hugo considers it as a path so it breaks
the path of "static files" (like images). We made a fix. Now I'm working on
"getting git to recognize files as renames" as you suggested.

@robert: The dates are nice but it causes verbose/long/ugly URLs. We
discussed with Aizhamal in the development stage and agreed to get rid of
this. For now, we keep the old URLs working in terms of redirecting them.
However, from now on, we should change the name convention on blog posts to
have a fancy URL like "beam.apache.org/blog/myblogpost.md". :)



On Thu, Apr 30, 2020 at 2:57 AM Robert Bradshaw  wrote:

> On Wed, Apr 29, 2020 at 5:08 PM Ahmet Altay  wrote:
>
>> Nam, this looks better. At least links are working, and the website
>> visually looks similar and generally in good shape. I think there are still
>> issues. For example, I do not see any of the images (e.g. the beam logo on
>> top left is missing.)
>>
>> On Wed, Apr 29, 2020 at 3:11 PM Brian Hulette 
>> wrote:
>>
>>> I left a comment on the PR [1]. I think the reason all of the website
>>> content is not being tracked as file renames is because there was a series
>>> of commits that created files in the new directory, and then one commit
>>> that deleted the old directory. If there were a single commit with all of
>>> the deleted and new files, git would surely recognize they are effectively
>>> renameds and mark them as such. Maybe we just need to get all these commits
>>> squashed into one?
>>>
>>> [1] https://github.com/apache/beam/pull/11554#issuecomment-621489844
>>>
>>
>> Nam, could you try this? If we can get git to recognize these as renames,
>> review process would be much easier.
>>
>
> +1.
>
> Alternatively, create a commit that just moves the files into a new
> location (which git can always detect), then sit the edits on top of that
> (which should preserve history better).
>
> Also, is there a reason the dates were removed from the blog post
> filenames? For content like that, the dates are nice.
>
>
>>
>>
>>>
>>> On Wed, Apr 29, 2020 at 10:39 AM Nam Bui  wrote:
>>>
 Hi guys,

 I'm Nam - from the responsible team of Apache Beam website migration. I
 am pleased to answer some of the questions here.

 @aizhamal: Thanks for informing to the community. :)
 @altay, @robertwb: Yes. there is a problem with the staged version at
 the moment. We didn't expect some behaviours on the build process. So, we
 fixed it today and been waiting for @pablo to re-run it again. The purpose
 of this PR is to migrate completely Beam site from Jekyll to Hugo.
 Therefore, a bunch of deleted markdown files are from Jekyll which was
 located at `beam/website/src`, and Hugo is located at `beam/website/www`
 now. In `beam/website/README.md`, I wrote down about running the Hugo
 website locally, although it is actually same as Jekyll (because it's also
 set up with Docker & Gradle). In `beam/website/CONTRIBUTE.md`, I guided
 people on how to get started with Hugo on the Beam website. There is also a
 link in the "Translation Guide" section which points to a branch of
 multilingual provenance, and it will become a next PR soon.

 Please let me know if you need more details. Feel free to ask any
 questions and I will get back to you with answers. I'm so sorry if I answer
 a little bit due to the timezone. :)

 Best regards,
 Nam



 On Tue, Apr 28, 2020 at 8:49 PM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

> Adding +Nam Bui  and +Karolina Rosół
>  to follow up on questions.
>
> On Tue, Apr 28, 2020 at 11:34 AM Ahmet Altay  wrote:
>
>> I am having trouble reviewing the staged version. What is the best
>> way to review this change?
>>
>> Do we expect any changes to markdown files, beyond some metadata?
>>
>> On Tue, Apr 28, 2020 at 10:45 AM Robert Bradshaw 
>> wrote:
>>
>>> Thanks. It'll be great to better support more languages.
>>>
>>> I looked at the PR and there seems to be no provenance/history. E.g.
>>> all the content seems to be entirely new files rather than diffs from 
>>> the
>>> old. (There also seems to be a huge amount of auto-generated js code as
>>> well.)
>>>
>>
>> I agree. This makes it very hard to review. I also see a bunch of
>> deleted markdown files. Are they not getting migrated?
>>
>>
>>>
>>> On Tue, Apr 28, 2020 at 10:23 AM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hello everybody,

 We are almost done migrating the Apache Beam website from Jekyll to
 Hugo. You can see the PR in [1], and we'd love to hear your
 feedback/comments on the PR. It includes  detailed gu

Re: possible bug in AvroUtils

2020-04-30 Thread Reuven Lax
I'm not sure who added that, but it's been there for a while. Making global
static changes like that in our module seems like poor form - I wonder if
there's a better approach.

On Thu, Apr 30, 2020 at 8:36 AM Brian Hulette  wrote:

> It seems likely this is a side effect of some static initialization in
> AvroUtils:
> https://github.com/apache/beam/blob/763b7ccd17a420eb634d6799adcd3ecfcf33d6a7/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L99
>
> On Wed, Apr 29, 2020 at 9:59 PM Reuven Lax  wrote:
>
>> I've copied this failing test into my client, and it passes for me. I
>> can't reproduce the failure.
>>
>> On Wed, Apr 29, 2020 at 6:34 PM Luke Cwik  wrote:
>>
>>> +dev  +Brian Hulette  +Reuven
>>> Lax 
>>>
>>> On Wed, Apr 29, 2020 at 4:21 AM Paolo Tomeo  wrote:
>>>
 Hi all,

 I think the method AvroUtils.toBeamSchema has a not expected side
 effect.
 I found out that, if you invoke it and then you run a pipeline of
 GenericRecords containing a timestamp (l tried with logical-type
 timestamp-millis), Beam converts such timestamp from long to
 org.joda.time.DateTime. Even if you don't apply any transformation to the
 pipeline.
 Do you think it's a bug?

 Below you can find a simple test class I wrote in order to replicate
 the problem.
 The first test passes while the second fails.


 import org.apache.avro.Schema;
 import org.apache.avro.SchemaBuilder;
 import org.apache.avro.generic.GenericRecord;
 import org.apache.avro.generic.GenericRecordBuilder;
 import org.apache.beam.sdk.coders.AvroCoder;
 import org.apache.beam.sdk.schemas.utils.AvroUtils;
 import org.apache.beam.sdk.testing.TestPipeline;
 import org.apache.beam.sdk.transforms.Combine;
 import org.apache.beam.sdk.transforms.Create;
 import org.apache.beam.sdk.transforms.SerializableFunction;
 import org.junit.Rule;

 import java.sql.Timestamp;

 import static org.junit.Assert.assertEquals;

 public class AvroUtilsSideEffect {

 @Rule
 public final transient TestPipeline pipeline = TestPipeline.create();
 @Rule
 public final transient TestPipeline pipeline2 = TestPipeline.create();
 public final Schema testSchema = SchemaBuilder
 .record("record").namespace("test")
 .fields()
 .name("timestamp").type().longBuilder().prop("logicalType", 
 "timestamp-millis").endLong().noDefault()
 .endRecord();
 public final GenericRecord record = new 
 GenericRecordBuilder(testSchema)
 .set("timestamp", new Timestamp(156392640L).getTime())
 .build();


 @org.junit.Test
 public void test() {
 pipeline.apply( 
 Create.of(record).withCoder(AvroCoder.of(testSchema)))
 .apply( Combine.globally(new TestFn()));

 pipeline.run().waitUntilFinish();
 }
 @org.junit.Test
 public void test2() {

 AvroUtils.toBeamSchema(testSchema);

 
 pipeline2.apply(Create.of(record).withCoder(AvroCoder.of(testSchema)))
 .apply(Combine.globally(new TestFn()));

 pipeline2.run().waitUntilFinish();
 }

 public static class TestFn implements 
 SerializableFunction, GenericRecord> {

 @Override
 public GenericRecord apply(Iterable input) {
 for (GenericRecord item : input) {
 if(item != null){
 assertEquals(Long.class, 
 item.get("timestamp").getClass());
 assertEquals(156392640L, item.get("timestamp"));
 }
 return item;
 }
 return null;
 }
 }
 }

 Thanks,
 Paolo

 --
 Paolo Tomeo, PhD

 Big Data and Machine Learning Engineer

 linkedin.com/in/ptomeo 

>>>


Re: possible bug in AvroUtils

2020-04-30 Thread Brian Hulette
It seems likely this is a side effect of some static initialization in
AvroUtils:
https://github.com/apache/beam/blob/763b7ccd17a420eb634d6799adcd3ecfcf33d6a7/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L99

On Wed, Apr 29, 2020 at 9:59 PM Reuven Lax  wrote:

> I've copied this failing test into my client, and it passes for me. I
> can't reproduce the failure.
>
> On Wed, Apr 29, 2020 at 6:34 PM Luke Cwik  wrote:
>
>> +dev  +Brian Hulette  +Reuven
>> Lax 
>>
>> On Wed, Apr 29, 2020 at 4:21 AM Paolo Tomeo  wrote:
>>
>>> Hi all,
>>>
>>> I think the method AvroUtils.toBeamSchema has a not expected side
>>> effect.
>>> I found out that, if you invoke it and then you run a pipeline of
>>> GenericRecords containing a timestamp (l tried with logical-type
>>> timestamp-millis), Beam converts such timestamp from long to
>>> org.joda.time.DateTime. Even if you don't apply any transformation to the
>>> pipeline.
>>> Do you think it's a bug?
>>>
>>> Below you can find a simple test class I wrote in order to replicate the
>>> problem.
>>> The first test passes while the second fails.
>>>
>>>
>>> import org.apache.avro.Schema;
>>> import org.apache.avro.SchemaBuilder;
>>> import org.apache.avro.generic.GenericRecord;
>>> import org.apache.avro.generic.GenericRecordBuilder;
>>> import org.apache.beam.sdk.coders.AvroCoder;
>>> import org.apache.beam.sdk.schemas.utils.AvroUtils;
>>> import org.apache.beam.sdk.testing.TestPipeline;
>>> import org.apache.beam.sdk.transforms.Combine;
>>> import org.apache.beam.sdk.transforms.Create;
>>> import org.apache.beam.sdk.transforms.SerializableFunction;
>>> import org.junit.Rule;
>>>
>>> import java.sql.Timestamp;
>>>
>>> import static org.junit.Assert.assertEquals;
>>>
>>> public class AvroUtilsSideEffect {
>>>
>>> @Rule
>>> public final transient TestPipeline pipeline = TestPipeline.create();
>>> @Rule
>>> public final transient TestPipeline pipeline2 = TestPipeline.create();
>>> public final Schema testSchema = SchemaBuilder
>>> .record("record").namespace("test")
>>> .fields()
>>> .name("timestamp").type().longBuilder().prop("logicalType", 
>>> "timestamp-millis").endLong().noDefault()
>>> .endRecord();
>>> public final GenericRecord record = new GenericRecordBuilder(testSchema)
>>> .set("timestamp", new Timestamp(156392640L).getTime())
>>> .build();
>>>
>>>
>>> @org.junit.Test
>>> public void test() {
>>> pipeline.apply( 
>>> Create.of(record).withCoder(AvroCoder.of(testSchema)))
>>> .apply( Combine.globally(new TestFn()));
>>>
>>> pipeline.run().waitUntilFinish();
>>> }
>>> @org.junit.Test
>>> public void test2() {
>>>
>>> AvroUtils.toBeamSchema(testSchema);
>>>
>>> 
>>> pipeline2.apply(Create.of(record).withCoder(AvroCoder.of(testSchema)))
>>> .apply(Combine.globally(new TestFn()));
>>>
>>> pipeline2.run().waitUntilFinish();
>>> }
>>>
>>> public static class TestFn implements 
>>> SerializableFunction, GenericRecord> {
>>>
>>> @Override
>>> public GenericRecord apply(Iterable input) {
>>> for (GenericRecord item : input) {
>>> if(item != null){
>>> assertEquals(Long.class, 
>>> item.get("timestamp").getClass());
>>> assertEquals(156392640L, item.get("timestamp"));
>>> }
>>> return item;
>>> }
>>> return null;
>>> }
>>> }
>>> }
>>>
>>> Thanks,
>>> Paolo
>>>
>>> --
>>> Paolo Tomeo, PhD
>>>
>>> Big Data and Machine Learning Engineer
>>>
>>> linkedin.com/in/ptomeo 
>>>
>>


"DNS resolution failed"

2020-04-30 Thread Maximilian Michels
Hi,

Is anyone familiar with this GRPC error? The build logs are full of it.
Also getting it on my machine when I run tests:

23:17:02 ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in 
the data plane.
23:17:02 Traceback (most recent call last):
23:17:02   File "apache_beam/runners/worker/data_plane.py", line 528, in 
_read_inputs
23:17:02 for elements in elements_iterator:
23:17:02   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Phrase/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 413, in next
23:17:02 return self._next()
23:17:02   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Phrase/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 689, in _next
23:17:02 raise self
23:17:02 _MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that 
terminated with:
23:17:02status = StatusCode.UNAVAILABLE
23:17:02details = "DNS resolution failed"
23:17:02debug_error_string = 
"{"created":"@1588108621.907750662","description":"Failed to pick 
subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3981,"referenced_errors":[{"created":"@1588108621.907745000","description":"Resolver
 transient 
failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":214,"referenced_errors":[{"created":"@1588108621.907743049","description":"DNS
 resolution 
failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":357,"grpc_status":14,"referenced_errors":[{"created":"@1588108621.907719737","description":"C-ares
 status is not ARES_SUCCESS: Misformatted domain 
name","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":244,"referenced_errors":[{"created":"@1588108621.907691960","description":"C-ares
 status is not ARES_SUCCESS: Misformatted domain 
name","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":244}]}]}]}]}"

https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/158/console

Looks like a recent regression. Tracked here:
https://jira.apache.org/jira/browse/BEAM-9851

Thanks,
Max



Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
*It's working again, probably because it's running on a different
machine now.

Who can check the disk space of the Jenkins hosts?

Thanks,
Max

On 30.04.20 11:55, Maximilian Michels wrote:
> Sorry, I meant to include the Jenkins log:
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> 
> Thanks for investigating Hannah! Indeed, I can see the no space left on
> device in the following but not in the log above:
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> 
> I'm going to try running the build again. Do you think we could add more
> storage to our Jenkins hosts or delete old build data?
> 
> Thanks,
> Max
> 
> On 30.04.20 08:43, Hannah Jiang wrote:
>> Max, I found a link from your PR and noticed below errors. This would be
>> the true error.
>>
>> *07:57:03* >*Task :sdks:python:container:py37:docker*
>> *07:57:03*  [91mERROR: Could not install packages due to an 
>> EnvironmentError: [Errno 28] No space left on device
>> *07:57:03*
>> *07:57:03*  [0m
>> *07:57:03* >*Task :sdks:python:container:py35:docker*
>> *07:57:03*  [91mERROR: Could not install packages due to an 
>> EnvironmentError: [Errno 28] No space left on device
>>
>>
>>
>> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang > > wrote:
>>
>> There is a PythonDocker Precommit test running for PRs with Python
>> changes. It seems running well.[1]
>> Max, can you please give me a link so I can check more details? Do
>> other images with different Python versions fail as well?
>>
>> 1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
>>
>>
>> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay > > wrote:
>>
>> +Valentyn Tymofieiev  +Hannah Jiang
>>  -- in case they have relevant
>> information.
>>
>> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
>> mailto:m...@apache.org>> wrote:
>>
>> Hi,
>>
>> has anyone noticed the Python 3.7 Docker container fails to
>> build? I
>> haven't been able to build the Python 3.7 container, neither
>> locally nor
>> on Jenkins.
>>
>> I get:
>>
>> 17:48:10 > Task :sdks:python:container:py37:docker
>> 17:49:36 The command '/bin/sh -c pip install -r
>> /tmp/base_image_requirements.txt && python -c "from
>> google.protobuf.internal import api_implementation; assert
>> api_implementation._default_implementation_type == 'cpp'; print
>> ('Verified fast protobuf used.')" && rm -rf
>> /root/.cache/pip' returned a
>> non-zero code: 1
>> 17:49:36
>> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
>>
>>
>> Cheers,
>> Max
>>


Re: Greetings from Tyson

2020-04-30 Thread Connell O'Callaghan
Welcome Tyson!!!



On Thu, Apr 30, 2020 at 6:12 AM Ismaël Mejía  wrote:

> Welcome!
>
> On Thu, Apr 30, 2020 at 12:27 AM Alan Myrvold  wrote:
> >
> > Welcome, Tyson!
> >
> > On Wed, Apr 29, 2020 at 3:15 PM Rui Wang  wrote:
> >>
> >> Welcome!
> >>
> >> -Rui
> >>
> >> On Wed, Apr 29, 2020, 3:13 PM Brian Hulette 
> wrote:
> >>>
> >>> Welcome Tyson!
> >>>
> >>> On Wed, Apr 29, 2020 at 2:54 PM Ahmet Altay  wrote:
> 
>  Welcome!
> 
>  On Tue, Apr 28, 2020 at 3:06 PM Hannah Jiang 
> wrote:
> >
> > Welcome to the community!
> >
> >
> > On Tue, Apr 28, 2020 at 2:45 PM Tyson Hamilton 
> wrote:
> >>
> >> Hello Beam Community,
> >>
> >> This is just a simple 'Hello' to introduce myself. I'm a Software
> Engineer at Google and have worked with data processing languages and
> runtime systems on and off during my career. I now have the pleasure of
> dedicating more time towards working with you lovely folks on Beam and I'm
> really excited!
> >>
> >> I hope you're all doing well and staying safe in these difficult
> times.
> >>
> >> -Tyson
> >>
> >>
> >>
>


Re: Greetings from Tyson

2020-04-30 Thread Ismaël Mejía
Welcome!

On Thu, Apr 30, 2020 at 12:27 AM Alan Myrvold  wrote:
>
> Welcome, Tyson!
>
> On Wed, Apr 29, 2020 at 3:15 PM Rui Wang  wrote:
>>
>> Welcome!
>>
>> -Rui
>>
>> On Wed, Apr 29, 2020, 3:13 PM Brian Hulette  wrote:
>>>
>>> Welcome Tyson!
>>>
>>> On Wed, Apr 29, 2020 at 2:54 PM Ahmet Altay  wrote:

 Welcome!

 On Tue, Apr 28, 2020 at 3:06 PM Hannah Jiang  
 wrote:
>
> Welcome to the community!
>
>
> On Tue, Apr 28, 2020 at 2:45 PM Tyson Hamilton  wrote:
>>
>> Hello Beam Community,
>>
>> This is just a simple 'Hello' to introduce myself. I'm a Software 
>> Engineer at Google and have worked with data processing languages and 
>> runtime systems on and off during my career. I now have the pleasure of 
>> dedicating more time towards working with you lovely folks on Beam and 
>> I'm really excited!
>>
>> I hope you're all doing well and staying safe in these difficult times.
>>
>> -Tyson
>>
>>
>>


Re: Rethinking Python's PortableRunner default job server

2020-04-30 Thread Ismaël Mejía
+1 for A there are zero reasons to have a default runner set by
default, being explicit is better as Robert suggests and it resolves
the confusion that the user reported.

On Wed, Apr 29, 2020 at 10:05 PM Robert Bradshaw  wrote:
>
> +1, I was actually thinking about this just the other day. PortableRunner 
> should require job_endpoint to be set, and we can have a nice error message 
> directing the explicit use of FlinkRunner for the old behavior.
>
> On Wed, Apr 29, 2020 at 11:50 AM Kyle Weaver  wrote:
>>
>> > Could the error message suggest switching to FlinkRunner (and/or other 
>> > runners that start a job server for you)? Then it seems like the breakage 
>> > would only be a minor annoyance.
>>
>> Definitely.
>>
>> On Wed, Apr 29, 2020 at 2:49 PM Brian Hulette  wrote:
>>>
>>> Could the error message suggest switching to FlinkRunner (and/or other 
>>> runners that start a job server for you)? Then it seems like the breakage 
>>> would only be a minor annoyance.
>>>
>>> Brian
>>>
>>> On Wed, Apr 29, 2020 at 11:32 AM Kyle Weaver  wrote:

 Hi all,

 Currently, when running a pipeline that has the options 
 runner=PortableRunner and job_endpoint unset, the Python SDK spins up a 
 Dockerized Flink job server [1]. This is problematic because the 
 PortableRunner can be used by any portable runner. So for example, a Spark 
 runner user was recently baffled when their job ran successfully but 
 printed a bunch of Flink log messages.

 There are not too many uses of this default behavior to my knowledge, at 
 least within Beam itself. The only example I could find was in the 
 portableWordCount tests, which is mostly the same as 
 portableWordCountFlinkRunner tests [2]. The default behavior is entirely 
 superseded by the FlinkRunner class, which provides better encapsulation.

 I also noticed that DockerizedJobServer is only used by [3]. In 
 FlinkRunner, we pull the job server from Maven if necessary and call Java 
 directly. In general, I think there are already quite enough knobs in the 
 portability framework, so we should remove it unless there is reason to 
 prefer running the job server with Docker instead of calling Java directly.

 There are a couple options:

 A) Remove the default behavior and require job_endpoint to always be set 
 when using PortableRunner. This would be a breaking change.
 B) Keep the current behavior, but warn when the user sets 
 runner=PortableRunner without job_endpoint. This is easy to miss, but it's 
 better than nothing.

 What do you think?

 [1] 
 https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
 [2] 
 https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
 [3] 
 https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163


Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
Sorry, I meant to include the Jenkins log:
https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console

Thanks for investigating Hannah! Indeed, I can see the no space left on
device in the following but not in the log above:
https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console

I'm going to try running the build again. Do you think we could add more
storage to our Jenkins hosts or delete old build data?

Thanks,
Max

On 30.04.20 08:43, Hannah Jiang wrote:
> Max, I found a link from your PR and noticed below errors. This would be
> the true error.
> 
> *07:57:03* >*Task :sdks:python:container:py37:docker*
> *07:57:03*  [91mERROR: Could not install packages due to an EnvironmentError: 
> [Errno 28] No space left on device
> *07:57:03*
> *07:57:03*  [0m
> *07:57:03* >*Task :sdks:python:container:py35:docker*
> *07:57:03*  [91mERROR: Could not install packages due to an EnvironmentError: 
> [Errno 28] No space left on device
> 
> 
> 
> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang  > wrote:
> 
> There is a PythonDocker Precommit test running for PRs with Python
> changes. It seems running well.[1]
> Max, can you please give me a link so I can check more details? Do
> other images with different Python versions fail as well?
> 
> 1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> 
> 
> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay  > wrote:
> 
> +Valentyn Tymofieiev  +Hannah Jiang
>  -- in case they have relevant
> information.
> 
> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> mailto:m...@apache.org>> wrote:
> 
> Hi,
> 
> has anyone noticed the Python 3.7 Docker container fails to
> build? I
> haven't been able to build the Python 3.7 container, neither
> locally nor
> on Jenkins.
> 
> I get:
> 
> 17:48:10 > Task :sdks:python:container:py37:docker
> 17:49:36 The command '/bin/sh -c pip install -r
> /tmp/base_image_requirements.txt && python -c "from
> google.protobuf.internal import api_implementation; assert
> api_implementation._default_implementation_type == 'cpp'; print
> ('Verified fast protobuf used.')" && rm -rf
> /root/.cache/pip' returned a
> non-zero code: 1
> 17:49:36
> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
> 
> 
> Cheers,
> Max
>