Re: Pipeline AttributeError on Python3

2019-11-06 Thread Rakesh Kumar
Thanks Valentyn,

Aggregation_transform.py doesn't have any transformation method which
extends beam.DoFn. We are using plain python method which we passed in
beam.Map().  I am not sure how to get the dump of serialized_fn. Can you
please let me the process?

I also heard that some people ran into this issue on Python 3.7.1 but the
same issue is not present on Python 3.7.3. Can you confirm this?



On Mon, Oct 28, 2019 at 5:00 PM Valentyn Tymofieiev 
wrote:

> +user@, bcc: dev@
> https://issues.apache.org/jira/browse/BEAM-6158 may be contributing to
> this issue, although we saw instances of this bug in exactly opposite
> scenarios - when pipeline was defined *in one file*, but not in multiple
> files.
>
> Could you try replacing instances of super() in aggregation_transform.py
> as done in https://github.com/apache/beam/pull/9513 and see if this issue
> is still reproducible?
>
> If that doesn't work, I would try to get the dump of serialized_fn, and
> try to reproduce the issue in isolated environment, such as:
>
> form apache_beam.internal import pickler
> serialized_fn = "..content.."
> pickler.loads(serialized_fn)
>
> then I would try to trim the doFn in the example to a
> minimally-reproducible example. It could be another issue with dill
> dependency.
>
>
> On Mon, Oct 28, 2019 at 2:48 PM Rakesh Kumar  wrote:
>
>> Hi All,
>>
>> We have noticed a weird intermittent issue on Python3 but we don't run
>> into this issue on python2. Sometimes when we are trying to submit the
>> pipeline, we get AttributeError (Check the stack trace below).  we have
>> double-checked and we do find the attribute/methods are present in the
>> right module and in right place but somehow the pipeline still complains
>> about it. In some cases, we refer methods before their definition. We tried
>> to reorder the method definition but that didn't help at all.
>>
>> We don't see the same issue when the entire pipeline is defined in one
>> file. Also, note that this doesn't happen all the time when we submit the
>> pipeline, so I feel it is some kind of race condition. When we enable the
>> worker recycle logic it happens most of the time when sdk worker is
>> recycled.
>>
>> Some more information about the environment:
>> Python version: 3
>> Beam version: 2.16
>> Flink version: 1.8
>>
>> *Stack trace: *
>>
>>- :
>>
>> TimerException{java.lang.RuntimeException: Failed to finish remote bundle}
>> at
>> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:335)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.RuntimeException: Failed to finish remote bundle
>> at
>> org.apache.beam.runners.flink.translation.wrappers.streaming.ExecutableStageDoFnOperator$SdkHarnessDoFnRunner.finishBundle(ExecutableStageDoFnOperator.java:667)
>> at
>> org.apache.beam.runners.core.StatefulDoFnRunner.finishBundle(StatefulDoFnRunner.java:144)
>> at
>> org.apache.beam.runners.flink.translation.wrappers.streaming.ExecutableStageDoFnOperator$2.finishBundle(ExecutableStageDoFnOperator.java:754)
>> at
>> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.finishBundle(DoFnRunnerWithMetricsUpdate.java:86)
>> at
>> org.apache.beam.runners.core.SimplePushbackSideInputDoFnRunner.finishBundle(SimplePushbackSideInputDoFnRunner.java:118)
>> at
>> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.invokeFinishBundle(DoFnOperator.java:750)
>> at
>> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.checkInvokeFinishBundleByTime(DoFnOperator.java:744)
>> at
>> org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.lambda$open$1(DoFnOperator.java:460)
>> at
>> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:330)
>> ... 7 more
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.RuntimeException: Error received from SDK harness for instruction
>> 6: Traceback (most recent call last):
>>   File
>> "/srv/venvs/service/trusty/service_venv_python3.6/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 307, in get
>> processor = self.cached_bundle_processors[bundle_descriptor_id].pop()
>> IndexError: pop from empty list
>>
>> During handling of the above exception, another exception occurred:
>>
>> Traceback (most 

Re: 10,000 Pull Requests

2019-11-06 Thread Pablo Estrada
iiipe : )

On Thu, Nov 7, 2019 at 12:59 AM Kenneth Knowles  wrote:

> Awesome!
>
> Number of days from PR #1 and PR #1000: 211
> Number of days from PR #9000 and PR #1: 71
>
> Kenn
>
> On Wed, Nov 6, 2019 at 6:28 AM Łukasz Gajowy  wrote:
>
>> Yay! Nice! :)
>>
>> śr., 6 lis 2019 o 14:38 Maximilian Michels  napisał(a):
>>
>>> Just wanted to point out, we have crossed the 10,000 PRs mark :)
>>>
>>> ...and the winner is: https://github.com/apache/beam/pull/1
>>>
>>> Seriously, I think Beam's culture to promote PRs over direct access to
>>> the repository is remarkable. To another 10,000 PRs!
>>>
>>> Cheers,
>>> Max
>>>
>>


Re: Cython unit test suites running without Cythonized sources

2019-11-06 Thread Chad Dombrova
Another potential solution would be to _not_ use the sdist task to build
the tarball and let tox do it.  Tox should install cython on supported
platforms before running sdist itself (which it does by default unless you
explicitly provide it with a tarball, which we are doing).  This has the
added benefit of one less virtualenv.  Right now we create a virtualenv to
build the sdist tarball, then we create another virtualenv to run tox, then
tox creates a virtualenv to run the task.   It's unclear (to me) whether
the tarball is rebuilt for each tox task or if it's reused.

-chad


On Wed, Nov 6, 2019 at 6:13 PM Udi Meiri  wrote:

> I opened this bug today after commenting
>  on
> Chad's type hints PR.
> https://issues.apache.org/jira/browse/BEAM-8572?filter=-1
>
> I am 95% sure that our Precommit tests are using tarballs that are built
> without Cython (including the Cython tasks).
>
> I'm NOT currently working on fixing this. One solution might be to add an
> additional task (sdistCython) and tell gradle that sdist and the new task
> should not run concurrently.
> Does anyone want to take this up?
>


Cython unit test suites running without Cythonized sources

2019-11-06 Thread Udi Meiri
I opened this bug today after commenting
 on Chad's
type hints PR.
https://issues.apache.org/jira/browse/BEAM-8572?filter=-1

I am 95% sure that our Precommit tests are using tarballs that are built
without Cython (including the Cython tasks).

I'm NOT currently working on fixing this. One solution might be to add an
additional task (sdistCython) and tell gradle that sdist and the new task
should not run concurrently.
Does anyone want to take this up?


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-06 Thread Pablo Estrada
I've added some Geo information to the Beam GA report[1]. In my opinion, it
makes sense to target cities with the largest existing Beam presence (along
with, perhaps, places where people can travel easily).

- London and Paris are the top cities in Europe for visits to the Beam
site, so I feel that Paris could be a good location.
- In the US, NY is the largest single city, but if we count all of the
Silicon Valley locations near the top, then both NYC and Bay Area are very
large (Bay Area comes out on top).

- For Asia, I guess it's a little tricky. Are community members aware of
meetups that have been organized in Asia? We could try to work backwards
from that evidence of 'community engagement', and see if there are some
outstanding locations. In the GA report, Japan, China, India are all very
large - Bangalore is the second largest city globally for visits to the
Beam site. +Reza Rokni  thoughts?

[1] http://s.apache.org/beam-ga-report


On Thu, Nov 7, 2019 at 3:17 AM Griselda Cuevas  wrote:

> Hi Beam Community!
>
> I'd like to kick off a thread to discuss potential dates and venues for
> the 2020 Beam Summits.
>
> I did some research on industry conferences happening in 2020 and
> pre-selected a few ranges as follows:
>
> (2 days) NA between mid-May and mid-June
> (2 days) EU mid October
> (1 day) Asia Mini Summit:  March
>
> I'd like to hear your thoughts on these dates and get consensus on exact
> dates as the convo progresses.
>
> For locations these are the options I reviewed:
>
> *NA: *Austin Texas, Berkeley California, Mexico City.
> *Europe: *Warsaw, Barcelona, Paris
> Asia: Singapore
>
> Let the discussion begin!
> G (on behalf of the Beam Summit Steering Committee)
>
>
>
>


Getting contributor permission to JIRA

2019-11-06 Thread Wenjia Liu
Hi,

This is Wendy from Google. I'm contributing to adding more tests for Beam
Python. Could anyone add me as a contributor for JIRA? I'd like to assign
this issue BEAM-8575 to myself.

Thanks,
Wendy


(Question) SQL integration tests for MongoDb

2019-11-06 Thread Kirill Kozlov
Hi everyone!

I am trying to test MongoDb Sql Table, but not quite sure how to pass
pipeline options with the hostName, port, and databaseName used by Jenkins.

It looks like the integration test for MongoDbIO Connector obtain those
values from the
'beam/.test-infra/jenkins/job_PerformanceTests_MongoDBIO_IT.groovy' file
via calling the following methods in the 'gradle.build' file:
provideIntegrationTestingDependencies()
enableJavaPerformanceTesting()

Sql build file already has a task with the name 'integrationTest' defined
and does not let us do `enableJavaPerformanceTesting()`.

 I would really appreciate if someone could provide me with a couple of
pointers on getting this to work.

-
Kirill


Re: Is there good way to make Python SDK docs draft accessible?

2019-11-06 Thread Valentyn Tymofieiev
Hi Yoshiki,

Were you able to find the information you need to regenerate the
documentation?

Thanks,
Valentyn

On Tue, Oct 29, 2019 at 8:01 AM Yoshiki Obata 
wrote:

> Thank you for advising, Udi and Ahmet.
> I'll take a look at the release process.
>
> 2019年10月29日(火) 3:47 Ahmet Altay :
> >
> > Thank you for doing this. It should be possible to run tox as Udi
> suggested and create a PR for review purposes similar to the release
> process (see:
> https://beam.apache.org/contribute/release-guide/#build-the-pydoc-api-reference
> )
> >
> > /cc +Valentyn Tymofieiev -- This is likely a required item before
> retiring python 2 support.
> >
> > Ahmet
> >
> > On Mon, Oct 28, 2019 at 11:21 AM Udi Meiri  wrote:
> >>
> >> I believe that generating pydoc for the website is still a manual
> process (unlike the rest of the website?).
> >> The reviewer will need to manually generate the docs (checkout the PR,
> run tox -e docs).
> >>
> >> On Mon, Oct 28, 2019 at 10:55 AM Yoshiki Obata 
> wrote:
> >>>
> >>> Hi all.
> >>>
> >>> I'm working on enabling to generate Python SDK docs with Python3 [1]
> >>> I have modified scripts and now reviewing generated docs in someone’s
> >>> eyes is needed.
> >>>
> >>> But there seems to be no existing way to upload generated docs to
> >>> where anyone can access unlike website html which can be uploaded to
> >>> GCS via Jenkins job.
> >>> Would anyone know good way to make generated docs accessible for
> >>> anyone for convenience of reviewing them?
> >>>
> >>> [1] https://issues.apache.org/jira/browse/BEAM-7847
> >>>
> >>>
> >>> Best regards,
> >>> Yoshiki
> >>>
> >>>
> >>> --
> >>> Yoshiki Obata
> >>> mail: yoshiki.ob...@gmail.com
> >>> gh: https://github.com/lazylynx
>


Re: published containers overwrite locally built containers

2019-11-06 Thread Valentyn Tymofieiev
On Wed, Nov 6, 2019 at 3:28 PM Heejong Lee  wrote:

> I think that implicitly (and forcefully) pull the remote image is not good
> even in case of a bug fix. The better approach would be releasing a
> separate bug fix version. Implicitly pulling the updated version of the
> same container looks weird to me since it feels like releasing the jar
> artifact with the same version multiple times or publishing already
> published git branch again. However, I understand it's much easier to just
> update the container with the same tag than release another Beam version.
>

Fair point, if we need to release an update to Beam container image we
would likely have a bugfix release. However there are other use cases
(pre-release images, custom images) where a pull may result in a more
deterministic behavior as discussed in earlier messages.


>
> On Wed, Nov 6, 2019 at 8:05 AM Valentyn Tymofieiev 
> wrote:
>
>> I agree with the resolutions in the link Thomas mentioned [1]. Using
>> latest tag is not reliable, and a unique tag ID should be generated when
>> running tests on Jenkins against master branch.
>> I think pulling the latest image for the current tag is actually a
>> desired behavior, in case the external image was updated (due to a bug fix
>> for example). Our custom container documentation should reflect this
>> behavior.
>> Consider continuing the conversation in [1] to keep it in one place if
>> there are other suggestions/opinions.
>>
>> [1]
>> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>>
>>
>> On Fri, Nov 1, 2019 at 5:14 PM Thomas Weise  wrote:
>>
>>> More here:
>>> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Fri, Nov 1, 2019 at 10:56 AM Chamikara Jayalath 
>>> wrote:
>>>
 I think it makes sense to override published docker images with locally
 built versions when testing HEAD.

 Thanks,
 Cham

 On Thu, Oct 31, 2019 at 6:31 PM Heejong Lee  wrote:

> Hi, happy halloween!
>
> I'm looking into failing cross language post commit tests:
> https://issues.apache.org/jira/browse/BEAM-8534
> 
>
> After a few runs, I've found that published SDK harness containers
> overwrite locally built containers when docker pull happens. I can think 
> of
> two possible solutions here: 1) remove the published images with the 
> latest
> tag, so make the image with the latest tag available for testing and
> development purposes. 2) put serialVersionUID to the class printing out 
> the
> error.
>
> 2) doesn't sound like a fundamental solution if we're not going to
> attach serialVersionUID to all serializable classes. 1) might work but I'm
> not sure whether there's another use for the latest tag somewhere. Any
> better ideas?
>
> Thanks,
> Heejong
>



Re: Command for Beam worker on Spark cluster

2019-11-06 Thread Kyle Weaver
> Where can I extract these parameters from?

These parameters should be passed automatically when the process is run
(note the use of $* in the example script):
https://github.com/apache/beam/blob/fbc84b61240a3d83d9c19f7ccc17ba22e5d7e2c9/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessEnvironmentFactory.java#L115-L121

> Also, how spark executor can find the port that grpc server is running on?

Not sure which grpc server you mean here.

On Wed, Nov 6, 2019 at 3:32 PM Matthew K.  wrote:

> Thanks, still I need to pass parameters to the boot executable, such as,
> worker id, control endpoint, logging endpoint, etc.
>
> Where can I extract these parameters from? (In apache_beam Python code,
> those can be extracted from StartWorker request parameters)
>
> Also, how spark executor can find the port that grpc server is running on?
>
> *Sent:* Wednesday, November 06, 2019 at 5:07 PM
> *From:* "Kyle Weaver" 
> *To:* dev 
> *Subject:* Re: Command for Beam worker on Spark cluster
> In Docker mode, most everything's taken care of for you, but in process
> mode you have to do a lot of setup yourself. The command you're looking for
> is `sdks/python/container/build/target/launcher/linux_amd64/boot`. You will
> be required to have both that executable (which you can build from source
> using `./gradlew :sdks:python:container:build`) and a Python installation
> including Beam and other dependencies on all of your worker machines.
>
> The best example I know of is here:
> https://github.com/apache/beam/blob/cbf8a900819c52940a0edd90f59bf6aec55c817a/sdks/python/test-suites/portable/py2/build.gradle#L146-L165
>
> On Wed, Nov 6, 2019 at 2:24 PM Matthew K.  wrote:
>
>> Hi all,
>>
>> I am trying to run *Python* beam pipeline on a Spark cluster. Since
>> workers are running on separate nodes, I am using "PROCESS" for
>> "evironment_type" in pipeline options, but I couldn't find any
>> documentation on what "command" I should pass to "environment_config" to
>> run on the worker, so executor can be able to communicate with.
>>
>> Can someone help me on that?
>>
>


Re: Deprecate some or all of TestPipelineOptions?

2019-11-06 Thread Robert Bradshaw
+1 to all of these are probably obsolete at this point and would be
nice to remove.


On Wed, Nov 6, 2019 at 3:00 PM Kenneth Knowles  wrote:
>
> Good find. I think TestPipelineOptions is from very early days. It makes 
> sense to me that these are all obsolete. Some guesses, though I haven't dug 
> through commit history to confirm:
>
>  - TempRoot: a while ago TempLocation was optional, so I think this would 
> provide a default for things like gcpTempLocation and stagingLocation
>  - OnSuccessMatcher: for runners where pipeline used to not terminate in 
> streaming mode. Now I think every runner can successfully waitUntilFinish. 
> Also the current API for waitUntilFinish went through some evolutions around 
> asynchrony so it wasn't always a good choice.
>  - OnCreateMatcher: just for symmetry? I don't know
>  - TestTimeoutSeconds: probably also for the asychrony/waitUntilfinish issue
>
> Kenn
>
> On Wed, Nov 6, 2019 at 12:19 PM Brian Hulette  wrote:
>>
>> I recently came across TestPipelineOptions, and now I'm wondering if maybe 
>> it should be deprecated. It only seems to actually be supported for Spark 
>> and Dataflow (via TestSparkRunner and TestDataflowRunner), and I think it 
>> may make more sense to move the functionality it provides into the tests 
>> that need it.
>>
>> TestPipelineOptions currently has four attributes:
>>
>> # TempRoot
>> It's purpose isn't documented, but many tests read TempRoot and use it to 
>> set a TempLocation (example). I think this attribute makes sense (e.g. we 
>> can set TempRoot once and each test has its own subdirectory), but I'm not 
>> sure. Can anyone confirm the motivation for it? I'd like to at least add a 
>> docstring for it.
>>
>> # OnCreateMatcher
>> A way to register a matcher that will be checked right after a pipeline has 
>> started. It's never set except for in TestDataflowRunnerTest, so I think 
>> this is absolutely safe to remove.
>>
>> # OnSuccessMatcher
>> A way to register a matcher that will be checked right after a pipeline has 
>> successfully completed. This is used in several tests 
>> (RequiresStableInputIT, WordCountIT, ... 8 total occurrences), but I don't 
>> see why they couldn't all be replaced with a `p.run().waitUntilFinish()`, 
>> followed by an assert.
>>
>> I think the current approach is actually dangerous, because running these 
>> tests with runners other than TestDataflowRunner or TestSparkRunner means 
>> the matchers are never actually checked. This is actually how I came across 
>> TestPipelineOptions - I tried running a test with the DirectRunner and 
>> couldn't make it fail.
>>
>> # TestTimeoutSeconds
>> Seems to just be a wrapper for `waitUntilFinish(duration)`, and only used in 
>> one place. I think it would be cleaner for the test to be responsible for 
>> calling waitUntilFinish (which we do elsewhere), the only drawback is it 
>> requires a small refactor so the test has access to the PipelineResult 
>> object.
>>
>>
>> So I have a couple of questions for the community
>> 1) Are there thoughts on TempRoot? Can we get rid of it?
>> 2) Are there any objections to removing the other three attributes? Am I 
>> missing something? Unless there are any objections I think I'll write a 
>> patch to remove them.
>>
>> Thanks,
>> Brian


Re: Command for Beam worker on Spark cluster

2019-11-06 Thread Matthew K.

Thanks, still I need to pass parameters to the boot executable, such as, worker id, control endpoint, logging endpoint, etc.

 

Where can I extract these parameters from? (In apache_beam Python code, those can be extracted from StartWorker request parameters)

 

Also, how spark executor can find the port that grpc server is running on?

 

Sent: Wednesday, November 06, 2019 at 5:07 PM
From: "Kyle Weaver" 
To: dev 
Subject: Re: Command for Beam worker on Spark cluster


In Docker mode, most everything's taken care of for you, but in process mode you have to do a lot of setup yourself. The command you're looking for is `sdks/python/container/build/target/launcher/linux_amd64/boot`. You will be required to have both that executable (which you can build from source using `./gradlew :sdks:python:container:build`) and a Python installation including Beam and other dependencies on all of your worker machines.
 
The best example I know of is here: https://github.com/apache/beam/blob/cbf8a900819c52940a0edd90f59bf6aec55c817a/sdks/python/test-suites/portable/py2/build.gradle#L146-L165


 


On Wed, Nov 6, 2019 at 2:24 PM Matthew K.  wrote:




Hi all,

 

I am trying to run *Python* beam pipeline on a Spark cluster. Since workers are running on separate nodes, I am using "PROCESS" for "evironment_type" in pipeline options, but I couldn't find any documentation on what "command" I should pass to "environment_config" to run on the worker, so executor can be able to communicate with.

 

Can someone help me on that?










Re: published containers overwrite locally built containers

2019-11-06 Thread Heejong Lee
I think that implicitly (and forcefully) pull the remote image is not good
even in case of a bug fix. The better approach would be releasing a
separate bug fix version. Implicitly pulling the updated version of the
same container looks weird to me since it feels like releasing the jar
artifact with the same version multiple times or publishing already
published git branch again. However, I understand it's much easier to just
update the container with the same tag than release another Beam version.

On Wed, Nov 6, 2019 at 8:05 AM Valentyn Tymofieiev 
wrote:

> I agree with the resolutions in the link Thomas mentioned [1]. Using
> latest tag is not reliable, and a unique tag ID should be generated when
> running tests on Jenkins against master branch.
> I think pulling the latest image for the current tag is actually a desired
> behavior, in case the external image was updated (due to a bug fix for
> example). Our custom container documentation should reflect this behavior.
> Consider continuing the conversation in [1] to keep it in one place if
> there are other suggestions/opinions.
>
> [1]
> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>
>
> On Fri, Nov 1, 2019 at 5:14 PM Thomas Weise  wrote:
>
>> More here:
>> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>>
>>
>> On Fri, Nov 1, 2019 at 10:56 AM Chamikara Jayalath 
>> wrote:
>>
>>> I think it makes sense to override published docker images with locally
>>> built versions when testing HEAD.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, Oct 31, 2019 at 6:31 PM Heejong Lee  wrote:
>>>
 Hi, happy halloween!

 I'm looking into failing cross language post commit tests:
 https://issues.apache.org/jira/browse/BEAM-8534
 

 After a few runs, I've found that published SDK harness containers
 overwrite locally built containers when docker pull happens. I can think of
 two possible solutions here: 1) remove the published images with the latest
 tag, so make the image with the latest tag available for testing and
 development purposes. 2) put serialVersionUID to the class printing out the
 error.

 2) doesn't sound like a fundamental solution if we're not going to
 attach serialVersionUID to all serializable classes. 1) might work but I'm
 not sure whether there's another use for the latest tag somewhere. Any
 better ideas?

 Thanks,
 Heejong

>>>


Re: Command for Beam worker on Spark cluster

2019-11-06 Thread Kyle Weaver
In Docker mode, most everything's taken care of for you, but in process
mode you have to do a lot of setup yourself. The command you're looking for
is `sdks/python/container/build/target/launcher/linux_amd64/boot`. You will
be required to have both that executable (which you can build from source
using `./gradlew :sdks:python:container:build`) and a Python installation
including Beam and other dependencies on all of your worker machines.

The best example I know of is here:
https://github.com/apache/beam/blob/cbf8a900819c52940a0edd90f59bf6aec55c817a/sdks/python/test-suites/portable/py2/build.gradle#L146-L165

On Wed, Nov 6, 2019 at 2:24 PM Matthew K.  wrote:

> Hi all,
>
> I am trying to run *Python* beam pipeline on a Spark cluster. Since
> workers are running on separate nodes, I am using "PROCESS" for
> "evironment_type" in pipeline options, but I couldn't find any
> documentation on what "command" I should pass to "environment_config" to
> run on the worker, so executor can be able to communicate with.
>
> Can someone help me on that?
>


Re: Deprecate some or all of TestPipelineOptions?

2019-11-06 Thread Kenneth Knowles
Good find. I think TestPipelineOptions is from very early days. It makes
sense to me that these are all obsolete. Some guesses, though I haven't dug
through commit history to confirm:

 - TempRoot: a while ago TempLocation was optional, so I think this would
provide a default for things like gcpTempLocation and stagingLocation
 - OnSuccessMatcher: for runners where pipeline used to not terminate in
streaming mode. Now I think every runner can successfully waitUntilFinish.
Also the current API for waitUntilFinish went through some evolutions
around asynchrony so it wasn't always a good choice.
 - OnCreateMatcher: just for symmetry? I don't know
 - TestTimeoutSeconds: probably also for the asychrony/waitUntilfinish issue

Kenn

On Wed, Nov 6, 2019 at 12:19 PM Brian Hulette  wrote:

> I recently came across TestPipelineOptions, and now I'm wondering if maybe
> it should be deprecated. It only seems to actually be supported for Spark
> and Dataflow (via TestSparkRunner and TestDataflowRunner), and I think it
> may make more sense to move the functionality it provides into the tests
> that need it.
>
> TestPipelineOptions
> 
> currently has four attributes:
>
> # TempRoot
> It's purpose isn't documented, but many tests read TempRoot and use it to
> set a TempLocation (example
> ).
> I think this attribute makes sense (e.g. we can set TempRoot once and each
> test has its own subdirectory), but I'm not sure. Can anyone confirm the
> motivation for it? I'd like to at least add a docstring for it.
>
> # OnCreateMatcher
> A way to register a matcher that will be checked right after a pipeline
> has started. It's never set except for in TestDataflowRunnerTest, so I
> think this is absolutely safe to remove.
>
> # OnSuccessMatcher
> A way to register a matcher that will be checked right after a pipeline
> has successfully completed. This is used in several tests (
> RequiresStableInputIT
> ,
> WordCountIT
> ,
> ... 8 total occurrences), but I don't see why they couldn't all be replaced
> with a `p.run().waitUntilFinish()`, followed by an assert.
>
> I think the current approach is actually dangerous, because running these
> tests with runners other than TestDataflowRunner or TestSparkRunner means
> the matchers are never actually checked. This is actually how I came across
> TestPipelineOptions - I tried running a test with the DirectRunner and
> couldn't make it fail.
>
> # TestTimeoutSeconds
> Seems to just be a wrapper for `waitUntilFinish(duration)`, and only used
> in one place
> .
> I think it would be cleaner for the test to be responsible for calling
> waitUntilFinish (which we do elsewhere), the only drawback is it requires a
> small refactor so the test has access to the PipelineResult object.
>
>
> So I have a couple of questions for the community
> 1) Are there thoughts on TempRoot? Can we get rid of it?
> 2) Are there any objections to removing the other three attributes? Am I
> missing something? Unless there are any objections I think I'll write a
> patch to remove them.
>
> Thanks,
> Brian
>


Re: How to use a locally built worker image?

2019-11-06 Thread Kyle Weaver
> These are -SNAPSHOT or .dev and distinct from releases. So perhaps this
is just a matter of correct version number management?

In light of the concerns Valentyn & Ahmet raise, it seems safer to change
tags instead of removing the pull. PR for Java:
https://github.com/apache/beam/pull/10017

On Wed, Nov 6, 2019 at 1:41 PM Ahmet Altay  wrote:

>
>
> On Wed, Nov 6, 2019 at 1:34 PM Valentyn Tymofieiev 
> wrote:
>
>> On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver  wrote:
>>
>>> The way the Python SDK currently does this is to use the version as the
>>> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
>>> should never be any conflicts between a release and developer image, unless
>>> the user deliberately changes the image tags.
>>>
>>> > if a users' pipeline is relies on a  container image released by Beam
>>> ( or maybe a third party), external updates to such container image may not
>>> propagate to the pipeline workflow without an explicit pull
>>>
>>> There should only be one released container per release. Upgrades to a
>>> container image should not happen independently of the release process.
>>>
>>
>> Fair point, although we have not yet encountered issues requiring  an
>> update of a previously released Docker, so I would not rule out
>> considerations requiring us to re-release the image under the same tag. A
>> scenario that is possible today is multiple pushes of container image to
>> docker repo before the Beam release is finalized, so early adopters may be
>> affected by stale images without  a pull.
>>
>
> This is an interesting problem. It is true that adopters of RCs may get
> stuck pre-release candidates of those images. Could we still docker pull
> only if user is trying to use a default and released image tag?
>
>
>>
>>
>>> Note that so far I've just been discussing defaults. It's always
>>> possible to use a custom container using environment_config, as mentioned
>>> earlier.
>>>
>>
>> My understanding is that to pull or not to pull decision equally applies
>> to custom image provided by environment config.
>>
>>
>>> The goal is to make that unnecessary for most everyday use cases and
>>> development. Using different container images for different transforms is a
>>> more specialized use case worth a separate discussion.
>>>
>>> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 Anyway, I agree with Thomas that implicitly running `docker pull` is
> confusing and requires some adjustments to work around. The user can 
> always
> run `docker pull` themselves if that's the intention.


 I understand that implicit pull may come across as surprising. However
 I see the  required adjustments as a better practice. I would argue that
 customized containers images should not reuse the same  name:tag
 combination, and it would also help the users avoid a situation where a
 runner may use a different container image in different execution
 environments.
 It may also help avoid issue where a user reports an issue with Beam,
 that others cannot reproduce only because a user was running a customized
 container on their local machine (and forgot about it).
 Also, if a users' pipeline is relies on a  container image released by
 Beam ( or maybe a third party), external updates to such container image
 may not propagate to the pipeline workflow without an explicit pull

> > 1. Read sdk version from gradle.properties and use this as the
> default tag. Done with Python, need to implement it with Java and Go.
>
> 100% agree with this one. Using the same tag for local and release
> images has already caused a good deal of confusion. Filed BEAM-8570 and
> BEAM-8571 [2][3].
>
> > 2. Remove pulling images before executing docker run command. This
> should be fixed for Python, Java and Go.
>
> Valentyn (from [1]):
> > I think pulling the latest image for the current tag is actually a
> desired behavior, in case the external image was updated (due to a bug fix
> for example).
>
> There's a PR for this [4]. Once we fix the default tag for Java/Go
> containers, the dev and release containers will be distinct, which makes 
> it
> seldom important whether or not the image is `docker pull`ed. Anyway, I
> agree with Thomas that implicitly running `docker pull` is confusing and
> requires some adjustments to work around. The user can always run `docker
> pull` themselves if that's the intention.
>
> [1]
> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-8570
> [3] https://issues.apache.org/jira/browse/BEAM-8571
> [4] https://github.com/apache/beam/pull/9972
>
> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:
>
>> I do not believe this is a blocker for Beam 2.16. 

Command for Beam worker on Spark cluster

2019-11-06 Thread Matthew K.
Hi all,

 

I am trying to run *Python* beam pipeline on a Spark cluster. Since workers are running on separate nodes, I am using "PROCESS" for "evironment_type" in pipeline options, but I couldn't find any documentation on what "command" I should pass to "environment_config" to run on the worker, so executor can be able to communicate with.

 

Can someone help me on that?


Re: How to use a locally built worker image?

2019-11-06 Thread Ahmet Altay
On Wed, Nov 6, 2019 at 1:34 PM Valentyn Tymofieiev 
wrote:

> On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver  wrote:
>
>> The way the Python SDK currently does this is to use the version as the
>> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
>> should never be any conflicts between a release and developer image, unless
>> the user deliberately changes the image tags.
>>
>> > if a users' pipeline is relies on a  container image released by Beam (
>> or maybe a third party), external updates to such container image may not
>> propagate to the pipeline workflow without an explicit pull
>>
>> There should only be one released container per release. Upgrades to a
>> container image should not happen independently of the release process.
>>
>
> Fair point, although we have not yet encountered issues requiring  an
> update of a previously released Docker, so I would not rule out
> considerations requiring us to re-release the image under the same tag. A
> scenario that is possible today is multiple pushes of container image to
> docker repo before the Beam release is finalized, so early adopters may be
> affected by stale images without  a pull.
>

This is an interesting problem. It is true that adopters of RCs may get
stuck pre-release candidates of those images. Could we still docker pull
only if user is trying to use a default and released image tag?


>
>
>> Note that so far I've just been discussing defaults. It's always possible
>> to use a custom container using environment_config, as mentioned earlier.
>>
>
> My understanding is that to pull or not to pull decision equally applies
> to custom image provided by environment config.
>
>
>> The goal is to make that unnecessary for most everyday use cases and
>> development. Using different container images for different transforms is a
>> more specialized use case worth a separate discussion.
>>
>> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Anyway, I agree with Thomas that implicitly running `docker pull` is
 confusing and requires some adjustments to work around. The user can always
 run `docker pull` themselves if that's the intention.
>>>
>>>
>>> I understand that implicit pull may come across as surprising. However I
>>> see the  required adjustments as a better practice. I would argue that
>>> customized containers images should not reuse the same  name:tag
>>> combination, and it would also help the users avoid a situation where a
>>> runner may use a different container image in different execution
>>> environments.
>>> It may also help avoid issue where a user reports an issue with Beam,
>>> that others cannot reproduce only because a user was running a customized
>>> container on their local machine (and forgot about it).
>>> Also, if a users' pipeline is relies on a  container image released by
>>> Beam ( or maybe a third party), external updates to such container image
>>> may not propagate to the pipeline workflow without an explicit pull
>>>
 > 1. Read sdk version from gradle.properties and use this as the
 default tag. Done with Python, need to implement it with Java and Go.

 100% agree with this one. Using the same tag for local and release
 images has already caused a good deal of confusion. Filed BEAM-8570 and
 BEAM-8571 [2][3].

 > 2. Remove pulling images before executing docker run command. This
 should be fixed for Python, Java and Go.

 Valentyn (from [1]):
 > I think pulling the latest image for the current tag is actually a
 desired behavior, in case the external image was updated (due to a bug fix
 for example).

 There's a PR for this [4]. Once we fix the default tag for Java/Go
 containers, the dev and release containers will be distinct, which makes it
 seldom important whether or not the image is `docker pull`ed. Anyway, I
 agree with Thomas that implicitly running `docker pull` is confusing and
 requires some adjustments to work around. The user can always run `docker
 pull` themselves if that's the intention.

 [1]
 https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
 [2] https://issues.apache.org/jira/browse/BEAM-8570
 [3] https://issues.apache.org/jira/browse/BEAM-8571
 [4] https://github.com/apache/beam/pull/9972

 On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:

> I do not believe this is a blocker for Beam 2.16. I agree that it
> would be good to fix this.
>
> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang 
> wrote:
>
>> Hi Thomas
>>
>> Thanks for bring this up.
>>
>> Now Python uses sdk version as a default tag, while Java and Go use
>> latest as a default tag. I agree using latest as a tag is problematic. 
>> The
>> reason only Python uses sdk version as a default tag is Python has
>> version.py so the version is easy 

Re: How to use a locally built worker image?

2019-11-06 Thread Valentyn Tymofieiev
On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver  wrote:

> The way the Python SDK currently does this is to use the version as the
> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
> should never be any conflicts between a release and developer image, unless
> the user deliberately changes the image tags.
>
> > if a users' pipeline is relies on a  container image released by Beam (
> or maybe a third party), external updates to such container image may not
> propagate to the pipeline workflow without an explicit pull
>
> There should only be one released container per release. Upgrades to a
> container image should not happen independently of the release process.
>

Fair point, although we have not yet encountered issues requiring  an
update of a previously released Docker, so I would not rule out
considerations requiring us to re-release the image under the same tag. A
scenario that is possible today is multiple pushes of container image to
docker repo before the Beam release is finalized, so early adopters may be
affected by stale images without  a pull.


> Note that so far I've just been discussing defaults. It's always possible
> to use a custom container using environment_config, as mentioned earlier.
>

My understanding is that to pull or not to pull decision equally applies to
custom image provided by environment config.


> The goal is to make that unnecessary for most everyday use cases and
> development. Using different container images for different transforms is a
> more specialized use case worth a separate discussion.
>
> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev 
> wrote:
>
>> Anyway, I agree with Thomas that implicitly running `docker pull` is
>>> confusing and requires some adjustments to work around. The user can always
>>> run `docker pull` themselves if that's the intention.
>>
>>
>> I understand that implicit pull may come across as surprising. However I
>> see the  required adjustments as a better practice. I would argue that
>> customized containers images should not reuse the same  name:tag
>> combination, and it would also help the users avoid a situation where a
>> runner may use a different container image in different execution
>> environments.
>> It may also help avoid issue where a user reports an issue with Beam,
>> that others cannot reproduce only because a user was running a customized
>> container on their local machine (and forgot about it).
>> Also, if a users' pipeline is relies on a  container image released by
>> Beam ( or maybe a third party), external updates to such container image
>> may not propagate to the pipeline workflow without an explicit pull
>>
>>> > 1. Read sdk version from gradle.properties and use this as the default
>>> tag. Done with Python, need to implement it with Java and Go.
>>>
>>> 100% agree with this one. Using the same tag for local and release
>>> images has already caused a good deal of confusion. Filed BEAM-8570 and
>>> BEAM-8571 [2][3].
>>>
>>> > 2. Remove pulling images before executing docker run command. This
>>> should be fixed for Python, Java and Go.
>>>
>>> Valentyn (from [1]):
>>> > I think pulling the latest image for the current tag is actually a
>>> desired behavior, in case the external image was updated (due to a bug fix
>>> for example).
>>>
>>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>>> containers, the dev and release containers will be distinct, which makes it
>>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>>> agree with Thomas that implicitly running `docker pull` is confusing and
>>> requires some adjustments to work around. The user can always run `docker
>>> pull` themselves if that's the intention.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>>> [4] https://github.com/apache/beam/pull/9972
>>>
>>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:
>>>
 I do not believe this is a blocker for Beam 2.16. I agree that it would
 be good to fix this.

 On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang 
 wrote:

> Hi Thomas
>
> Thanks for bring this up.
>
> Now Python uses sdk version as a default tag, while Java and Go use
> latest as a default tag. I agree using latest as a tag is problematic. The
> reason only Python uses sdk version as a default tag is Python has
> version.py so the version is easy to read. For Java and Go, we need to 
> read
> it from gradle.properties when creating images with the default tag and
> when setting the default image.
>
> Here is what we need to do:
> 1. Read sdk version from gradle.properties and use this as the default
> tag. Done with Python, need to implement it with Java and Go.
> 2. Remove pulling images before 

Re: Contributing to Beam javadoc

2019-11-06 Thread Ismaël Mejía
Done, you can now self assign issues too, welcome Jonathan!

On Wed, Nov 6, 2019 at 10:00 PM Jonathan Alvarez-Gutierrez
 wrote:
>
> Hey,
>
> I just filed https://issues.apache.org/jira/browse/BEAM-8573 and wanted to 
> create a PR with a fix.
>
> I should also check if there's an extant documentation / Splittable DoFn 
> project that would pre-empt or subsume my teeny documentation fix.
>
> If not, I'd like to assign the issue to jagthebeetle (myself).
>
> Best,
> Jonathan


Re: Contributor Permission for ThriftIO

2019-11-06 Thread Ismaël Mejía
Hello,

Welcome, you have now contributor permissions and the ticket is assigned to you.
Sounds like a nice addition to have.

Best,
Ismaël

On Wed, Nov 6, 2019 at 6:07 PM Christopher Larsen
 wrote:
>
> Hello,
>
> I would like to contribute to the addition of ThriftIO. Would someone be able 
> to add me as a contributor? I have opened a ticket at BEAM-8561 and my Jira 
> ID is: clarsen.
>
> Best,
> Chris Larsen
>
> This message contains information that may be privileged or confidential and 
> is the property of the Quantiphi Inc and/or its affiliates. It is intended 
> only for the person to whom it is addressed. If you are not the intended 
> recipient, any review, dissemination, distribution, copying, storage or other 
> use of all or any portion of this message is strictly prohibited. If you 
> received this message in error, please immediately notify the sender by reply 
> e-mail and delete this message in its entirety


Contributing to Beam javadoc

2019-11-06 Thread Jonathan Alvarez-Gutierrez
Hey,

I just filed https://issues.apache.org/jira/browse/BEAM-8573 and wanted to
create a PR with a fix.

I should also check if there's an extant documentation / Splittable DoFn
project that would pre-empt or subsume my teeny documentation fix.

If not, I'd like to assign the issue to jagthebeetle (myself).

Best,
Jonathan


Deprecate some or all of TestPipelineOptions?

2019-11-06 Thread Brian Hulette
I recently came across TestPipelineOptions, and now I'm wondering if maybe
it should be deprecated. It only seems to actually be supported for Spark
and Dataflow (via TestSparkRunner and TestDataflowRunner), and I think it
may make more sense to move the functionality it provides into the tests
that need it.

TestPipelineOptions

currently has four attributes:

# TempRoot
It's purpose isn't documented, but many tests read TempRoot and use it to
set a TempLocation (example
).
I think this attribute makes sense (e.g. we can set TempRoot once and each
test has its own subdirectory), but I'm not sure. Can anyone confirm the
motivation for it? I'd like to at least add a docstring for it.

# OnCreateMatcher
A way to register a matcher that will be checked right after a pipeline has
started. It's never set except for in TestDataflowRunnerTest, so I think
this is absolutely safe to remove.

# OnSuccessMatcher
A way to register a matcher that will be checked right after a pipeline has
successfully completed. This is used in several tests (RequiresStableInputIT
,
WordCountIT
,
... 8 total occurrences), but I don't see why they couldn't all be replaced
with a `p.run().waitUntilFinish()`, followed by an assert.

I think the current approach is actually dangerous, because running these
tests with runners other than TestDataflowRunner or TestSparkRunner means
the matchers are never actually checked. This is actually how I came across
TestPipelineOptions - I tried running a test with the DirectRunner and
couldn't make it fail.

# TestTimeoutSeconds
Seems to just be a wrapper for `waitUntilFinish(duration)`, and only used
in one place
.
I think it would be cleaner for the test to be responsible for calling
waitUntilFinish (which we do elsewhere), the only drawback is it requires a
small refactor so the test has access to the PipelineResult object.


So I have a couple of questions for the community
1) Are there thoughts on TempRoot? Can we get rid of it?
2) Are there any objections to removing the other three attributes? Am I
missing something? Unless there are any objections I think I'll write a
patch to remove them.

Thanks,
Brian


Re: How to use a locally built worker image?

2019-11-06 Thread Kyle Weaver
The way the Python SDK currently does this is to use the version as the
default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
should never be any conflicts between a release and developer image, unless
the user deliberately changes the image tags.

> if a users' pipeline is relies on a  container image released by Beam (
or maybe a third party), external updates to such container image may not
propagate to the pipeline workflow without an explicit pull

There should only be one released container per release. Upgrades to a
container image should not happen independently of the release process.

Note that so far I've just been discussing defaults. It's always possible
to use a custom container using environment_config, as mentioned earlier.
The goal is to make that unnecessary for most everyday use cases and
development. Using different container images for different transforms is a
more specialized use case worth a separate discussion.

On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev 
wrote:

> Anyway, I agree with Thomas that implicitly running `docker pull` is
>> confusing and requires some adjustments to work around. The user can always
>> run `docker pull` themselves if that's the intention.
>
>
> I understand that implicit pull may come across as surprising. However I
> see the  required adjustments as a better practice. I would argue that
> customized containers images should not reuse the same  name:tag
> combination, and it would also help the users avoid a situation where a
> runner may use a different container image in different execution
> environments.
> It may also help avoid issue where a user reports an issue with Beam, that
> others cannot reproduce only because a user was running a customized
> container on their local machine (and forgot about it).
> Also, if a users' pipeline is relies on a  container image released by
> Beam ( or maybe a third party), external updates to such container image
> may not propagate to the pipeline workflow without an explicit pull
>
>> > 1. Read sdk version from gradle.properties and use this as the default
>> tag. Done with Python, need to implement it with Java and Go.
>>
>> 100% agree with this one. Using the same tag for local and release images
>> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
>> [2][3].
>>
>> > 2. Remove pulling images before executing docker run command. This
>> should be fixed for Python, Java and Go.
>>
>> Valentyn (from [1]):
>> > I think pulling the latest image for the current tag is actually a
>> desired behavior, in case the external image was updated (due to a bug fix
>> for example).
>>
>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>> containers, the dev and release containers will be distinct, which makes it
>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>> agree with Thomas that implicitly running `docker pull` is confusing and
>> requires some adjustments to work around. The user can always run `docker
>> pull` themselves if that's the intention.
>>
>> [1]
>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>> [4] https://github.com/apache/beam/pull/9972
>>
>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:
>>
>>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>>> be good to fix this.
>>>
>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang 
>>> wrote:
>>>
 Hi Thomas

 Thanks for bring this up.

 Now Python uses sdk version as a default tag, while Java and Go use
 latest as a default tag. I agree using latest as a tag is problematic. The
 reason only Python uses sdk version as a default tag is Python has
 version.py so the version is easy to read. For Java and Go, we need to read
 it from gradle.properties when creating images with the default tag and
 when setting the default image.

 Here is what we need to do:
 1. Read sdk version from gradle.properties and use this as the default
 tag. Done with Python, need to implement it with Java and Go.
 2. Remove pulling images before executing docker run command. This
 should be fixed for Python, Java and Go.

 Is this a blocker for 2.16? If so and above are too much work for 2.16
 at the moment, we can hardcode the default tag for release branch for now.

 Using timestamp as a tag is an option as well, as long as runners know
 which timestamp they should use.

 Hannah

 On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold 
 wrote:

> Yes, using the latest tag is problematic and can lead to unexpected
> behavior.
> Using a date/time or 2.17.0.dev-$USER tag would be better. The
> validates container shell script uses a datetime
> 

Re: How to use a locally built worker image?

2019-11-06 Thread Thomas Weise
As developer in Beam, I expect something that I built locally to be used.
That's the case with Java and Python dependencies also.

These are -SNAPSHOT or .dev and distinct from releases. So perhaps this is
just a matter of correct version number management?

Users that consume our releases (version) should see the published Docker
images.

It would be great if that just works out of the box, w/o extra pipeline
options magic.

Thomas

On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev 
wrote:

> Anyway, I agree with Thomas that implicitly running `docker pull` is
>> confusing and requires some adjustments to work around. The user can always
>> run `docker pull` themselves if that's the intention.
>
>
> I understand that implicit pull may come across as surprising. However I
> see the  required adjustments as a better practice. I would argue that
> customized containers images should not reuse the same  name:tag
> combination, and it would also help the users avoid a situation where a
> runner may use a different container image in different execution
> environments.
> It may also help avoid issue where a user reports an issue with Beam, that
> others cannot reproduce only because a user was running a customized
> container on their local machine (and forgot about it).
> Also, if a users' pipeline is relies on a  container image released by
> Beam ( or maybe a third party), external updates to such container image
> may not propagate to the pipeline workflow without an explicit pull. Always
> pulling the image may help to ensure a more deterministic behavior.
>
> On Wed, Nov 6, 2019 at 10:38 AM Kyle Weaver  wrote:
>
>> Bumping this thread from the other one [1].
>>
>> > 1. Read sdk version from gradle.properties and use this as the default
>> tag. Done with Python, need to implement it with Java and Go.
>>
>> 100% agree with this one. Using the same tag for local and release images
>> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
>> [2][3].
>>
>> > 2. Remove pulling images before executing docker run command. This
>> should be fixed for Python, Java and Go.
>>
>> Valentyn (from [1]):
>> > I think pulling the latest image for the current tag is actually a
>> desired behavior, in case the external image was updated (due to a bug fix
>> for example).
>>
>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>> containers, the dev and release containers will be distinct, which makes it
>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>> agree with Thomas that implicitly running `docker pull` is confusing and
>> requires some adjustments to work around. The user can always run `docker
>> pull` themselves if that's the intention.
>>
>> [1]
>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>> [4] https://github.com/apache/beam/pull/9972
>>
>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:
>>
>>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>>> be good to fix this.
>>>
>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang 
>>> wrote:
>>>
 Hi Thomas

 Thanks for bring this up.

 Now Python uses sdk version as a default tag, while Java and Go use
 latest as a default tag. I agree using latest as a tag is problematic. The
 reason only Python uses sdk version as a default tag is Python has
 version.py so the version is easy to read. For Java and Go, we need to read
 it from gradle.properties when creating images with the default tag and
 when setting the default image.

 Here is what we need to do:
 1. Read sdk version from gradle.properties and use this as the default
 tag. Done with Python, need to implement it with Java and Go.
 2. Remove pulling images before executing docker run command. This
 should be fixed for Python, Java and Go.

 Is this a blocker for 2.16? If so and above are too much work for 2.16
 at the moment, we can hardcode the default tag for release branch for now.

 Using timestamp as a tag is an option as well, as long as runners know
 which timestamp they should use.

 Hannah

 On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold 
 wrote:

> Yes, using the latest tag is problematic and can lead to unexpected
> behavior.
> Using a date/time or 2.17.0.dev-$USER tag would be better. The
> validates container shell script uses a datetime
> 
> tag, which allows a unique name if no two tests are run in the same 
> second.
>
> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise  wrote:
>
>> Want to bump this thread.
>>
>> If the current behavior is to 

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-06 Thread Robert Bradshaw
Yes, the portability framework is designed to support this, and
possibly even more efficient transfers of data than element-by-element
as per the wire coder specified in the IO port operators. I left some
comments on the doc as well, and would also prefer approach 2.

On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles  wrote:
>
> I think the portability framework is designed for this. The runner controls 
> the coder on the grpc ports and the runner controls the process bundle 
> descriptor.
>
> I commented on the doc. I think what is missing is analysis of scope of SDK 
> harness changes and risk to model consistency
>
> Approach 2: probably no SDK harness work / compatible with existing Beam 
> model so no risk of introducing inconsistency
>
> Approach 1: what are all the details?
> option a: if the SDK harness has to understand "values without 
> windows" then very large changes and high risk of introducing inconsistency 
> (I eliminated many of these inconsistencies)
> option b: if the coder just puts default window/timestamp/pane info 
> on elements, then it is the same as approach 2, no work / no risk
>
> Kenn
>
> On Wed, Nov 6, 2019 at 1:09 AM jincheng sun  wrote:
>>
>> Hi all,
>>
>> I am trying to make some improvements of portability framework to make it 
>> usable in other projects. However, we find that the coder between runner and 
>> harness can only be FullWindowedValueCoder. This means each time when 
>> sending a WindowedValue, we have to encode/decode timestamp, windows and pan 
>> infos. In some circumstances(such as using the portability framework in 
>> Flink), only values are needed between runner and harness. So, it would be 
>> nice if we can configure the coder and avoid redundant encoding and decoding 
>> between runner and harness to improve the performance.
>>
>> There are two approaches to solve this issue:
>>
>> Approach 1:  Support ValueOnlyWindowedValueCoder between runner and 
>> harness.
>> Approach 2:  Add a "constant" window coder that embeds all the windowing 
>> information as part of the coder that should be used to wrap the value 
>> during decoding.
>>
>> More details can be found here [1].
>>
>> As of the shortcomings of “Approach 2” which still need to encode/decode 
>> timestamp and pane infos, we tend to choose “Approach 1” which brings better 
>> performance and is more thorough.
>>
>> Welcome any feedback :)
>>
>> Best,
>> Jincheng
>>
>> [1] 
>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>


Re: How to use a locally built worker image?

2019-11-06 Thread Valentyn Tymofieiev
>
> Anyway, I agree with Thomas that implicitly running `docker pull` is
> confusing and requires some adjustments to work around. The user can always
> run `docker pull` themselves if that's the intention.


I understand that implicit pull may come across as surprising. However I
see the  required adjustments as a better practice. I would argue that
customized containers images should not reuse the same  name:tag
combination, and it would also help the users avoid a situation where a
runner may use a different container image in different execution
environments.
It may also help avoid issue where a user reports an issue with Beam, that
others cannot reproduce only because a user was running a customized
container on their local machine (and forgot about it).
Also, if a users' pipeline is relies on a  container image released by Beam
( or maybe a third party), external updates to such container image may not
propagate to the pipeline workflow without an explicit pull. Always pulling
the image may help to ensure a more deterministic behavior.

On Wed, Nov 6, 2019 at 10:38 AM Kyle Weaver  wrote:

> Bumping this thread from the other one [1].
>
> > 1. Read sdk version from gradle.properties and use this as the default
> tag. Done with Python, need to implement it with Java and Go.
>
> 100% agree with this one. Using the same tag for local and release images
> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
> [2][3].
>
> > 2. Remove pulling images before executing docker run command. This
> should be fixed for Python, Java and Go.
>
> Valentyn (from [1]):
> > I think pulling the latest image for the current tag is actually a
> desired behavior, in case the external image was updated (due to a bug fix
> for example).
>
> There's a PR for this [4]. Once we fix the default tag for Java/Go
> containers, the dev and release containers will be distinct, which makes it
> seldom important whether or not the image is `docker pull`ed. Anyway, I
> agree with Thomas that implicitly running `docker pull` is confusing and
> requires some adjustments to work around. The user can always run `docker
> pull` themselves if that's the intention.
>
> [1]
> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-8570
> [3] https://issues.apache.org/jira/browse/BEAM-8571
> [4] https://github.com/apache/beam/pull/9972
>
> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay  wrote:
>
>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>> be good to fix this.
>>
>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Thomas
>>>
>>> Thanks for bring this up.
>>>
>>> Now Python uses sdk version as a default tag, while Java and Go use
>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>> reason only Python uses sdk version as a default tag is Python has
>>> version.py so the version is easy to read. For Java and Go, we need to read
>>> it from gradle.properties when creating images with the default tag and
>>> when setting the default image.
>>>
>>> Here is what we need to do:
>>> 1. Read sdk version from gradle.properties and use this as the default
>>> tag. Done with Python, need to implement it with Java and Go.
>>> 2. Remove pulling images before executing docker run command. This
>>> should be fixed for Python, Java and Go.
>>>
>>> Is this a blocker for 2.16? If so and above are too much work for 2.16
>>> at the moment, we can hardcode the default tag for release branch for now.
>>>
>>> Using timestamp as a tag is an option as well, as long as runners know
>>> which timestamp they should use.
>>>
>>> Hannah
>>>
>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold 
>>> wrote:
>>>
 Yes, using the latest tag is problematic and can lead to unexpected
 behavior.
 Using a date/time or 2.17.0.dev-$USER tag would be better. The
 validates container shell script uses a datetime
 
 tag, which allows a unique name if no two tests are run in the same second.

 On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise  wrote:

> Want to bump this thread.
>
> If the current behavior is to replace locally built image with the
> last published, then this is not only unexpected for developers but also
> problematic for the CI, where tests should run against what was built from
> source. Or am I missing something?
>
> Thanks,
> Thomas
>
>
> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise  wrote:
>
>> Hi Hannah,
>>
>> I believe this is unexpected from the developer perspective. When
>> building something locally, we do expect that to be used. We may need to
>> change to not pull when the image is available locally, at least when it 
>> is

[Discuss] Beam Summit 2020 Dates & locations

2019-11-06 Thread Griselda Cuevas
Hi Beam Community!

I'd like to kick off a thread to discuss potential dates and venues for the
2020 Beam Summits.

I did some research on industry conferences happening in 2020 and
pre-selected a few ranges as follows:

(2 days) NA between mid-May and mid-June
(2 days) EU mid October
(1 day) Asia Mini Summit:  March

I'd like to hear your thoughts on these dates and get consensus on exact
dates as the convo progresses.

For locations these are the options I reviewed:

*NA: *Austin Texas, Berkeley California, Mexico City.
*Europe: *Warsaw, Barcelona, Paris
Asia: Singapore

Let the discussion begin!
G (on behalf of the Beam Summit Steering Committee)


Re: [EXTERNAL] Re: FirestoreIO connector [JavaSDK]

2019-11-06 Thread Chamikara Jayalath
BTW, FYI, I'm also talking with folks from Google Firestore team regarding
this. I think they had shown some interest in taking this up but I'm not
sure.
If they are able to contribute here, and if we can coordinate with some of
those folks on this effort, it will be great for the long term health and
maintenance of the connector.

Thanks,
Cham

On Wed, Nov 6, 2019 at 1:56 AM Stefan Djelekar 
wrote:

> Thanks for the valuable information.
>
>
>
> We’ve been exactly looking up to that Datastore connector, but there are
> some differences of course.
>
>
>
> I’ll make the PR in the next few days and let’s pick it up from there.
>
>
>
> King regards,
>
> Stefan
>
>
>
>
>
> *From:* Chamikara Jayalath 
> *Sent:* Tuesday, November 5, 2019 2:24 AM
> *To:* dev 
> *Subject:* [EXTERNAL] Re: FirestoreIO connector [JavaSDK]
>
>
>
> Thanks for the contribution. Happy to help with the review.
>
> Also, probably it'll be good to follow patterns employed by the existing
> Datastore connector when applicable:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java
>
>
>
> Thanks,
>
> Cham
>
>
>
> On Mon, Nov 4, 2019 at 2:37 AM Ismaël Mejía  wrote:
>
> Hello,
>
> Please open a PR with the code of the new connector rebased against
> the latest master.
>
> It is worth to take a look at Beam's contribution guide [1] and
> PTransform style guide [2] in advance.
> In any case we can guide you on any issue during the PR review so
> please go ahead.
>
> Regards,
> Ismaël
>
> [1] https://beam.apache.org/contribute/
> [2] https://beam.apache.org/contribute/ptransform-style-guide/
>
>
> On Mon, Nov 4, 2019 at 10:48 AM Stefan Djelekar
>  wrote:
> >
> > Hi beam devs,
> >
> >
> >
> > I'm Stefan from Redbox. We are a customer of GCP and we are in need of
> beam Java connector for Firestore.
> >
> >
> >
> > There is a pending JIRA item for this.
> https://issues.apache.org/jira/browse/BEAM-8376
> >
> >
> >
> > The team inside of the company has been working on this for a while and
> we would like to contribute!
> >
> > What is the best way to do this? Can we perhaps get a review from an
> experienced beam member?
> >
> > Let me know what do you think.
> >
> >
> >
> > All the best,
> >
> >
> >
> > Stefan Đelekar
> >
> > Sofware Engineer
> >
> > stefan.djele...@redbox.com
> >
> > djelekar.com
> >
> >
>
>


Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-06 Thread Kenneth Knowles
I think the portability framework is designed for this. The runner controls
the coder on the grpc ports and the runner controls the process bundle
descriptor.

I commented on the doc. I think what is missing is analysis of scope of SDK
harness changes and risk to model consistency

Approach 2: probably no SDK harness work / compatible with existing
Beam model so no risk of introducing inconsistency

Approach 1: what are all the details?
option a: if the SDK harness has to understand "values without
windows" then very large changes and high risk of introducing inconsistency
(I eliminated many of these inconsistencies)
option b: if the coder just puts default window/timestamp/pane info
on elements, then it is the same as approach 2, no work / no risk

Kenn

On Wed, Nov 6, 2019 at 1:09 AM jincheng sun 
wrote:

> Hi all,
>
> I am trying to make some improvements of portability framework to make it
> usable in other projects. However, we find that the coder between runner
> and harness can only be FullWindowedValueCoder. This means each time when
> sending a WindowedValue, we have to encode/decode timestamp, windows and
> pan infos. In some circumstances(such as using the portability framework in
> Flink), only values are needed between runner and harness. So, it would be
> nice if we can configure the coder and avoid redundant encoding and
> decoding between runner and harness to improve the performance.
>
> There are two approaches to solve this issue:
>
> Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
> harness.
> Approach 2:  Add a "constant" window coder that embeds all the
> windowing information as part of the coder that should be used to wrap the
> value during decoding.
>
> More details can be found here [1].
>
> As of the shortcomings of “Approach 2” which still need to encode/decode
> timestamp and pane infos, we tend to choose “Approach 1” which brings
> better performance and is more thorough.
>
> Welcome any feedback :)
>
> Best,
> Jincheng
>
> [1]
> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>
>


Contributor Permission for ThriftIO

2019-11-06 Thread Christopher Larsen
Hello,

I would like to contribute to the addition of ThriftIO. Would someone be
able to add me as a contributor? I have opened a ticket at BEAM-8561
 and my Jira ID is:
clarsen.

Best,
Chris Larsen

-- 
_This message contains information that may be privileged or confidential 
and is the property of the Quantiphi Inc and/or its affiliates_. It is 
intended only for the person to whom it is addressed. _If you are not the 
intended recipient, any review, dissemination, distribution, copying, 
storage or other use of all or any portion of this message is strictly 
prohibited. If you received this message in error, please immediately 
notify the sender by reply e-mail and delete this message in its 
*entirety*___


Re: Key encodings for state requests

2019-11-06 Thread Robert Bradshaw
On Wed, Nov 6, 2019 at 2:55 AM Maximilian Michels  wrote:
>
> Let me try to clarify:
>
> > The Coder used for State/Timers in a StatefulDoFn is pulled out of the
> > input PCollection. If a Runner needs to partition by this coder, it
> > should ensure the coder of this PCollection matches with the Coder
> > used to create the serialized bytes that are used for partitioning
> > (whether or not this is length-prefixed).
>
> That is essentially what I had assumed when I wrote the code. The
> problem is the coder can be "pulled out" in different ways.
>
> For example, let's say we have the following Proto PCollection coder
> with non-standard coder "CustomCoder" as the key coder:
>
>KvCoder
>
>  From the Runner side, this currently looks like the following:
>
>PCol: KvCoder, VarIntCoder>
>Key:  LengthPrefixCoder

This is I think where the error is. When If the proto references
KvCoder it should not be pulled out as
KvCoder, VarIntCoder>; as that
doesn't have the same encoding. Trying to do instantiate such a coder
should be an error. Instead, the runner knows ahead of time that it
will need to instantiate this coder, and should update the bundle
processor to specify KvCoder,
VarIntCoder> as the coder so both can pull it out in a consistent way.

When the coder is KvCoder, VarIntCoder>
instantiating it as KvCoder on the runner
is of course OK as they do have the same encoding.

> At the SDK Harness, we have the coder available:
>
>PCol: KvCoder
>Key:  CustomCoder
>
> Currently, when the SDK Harness serializes a key for a state request,
> the custom coder may happen to add a length prefix, or it may not. It
> depends on the coder used. The correct behavior would be to use the same
> representation as on the Runner side.
>
> > Specifically, "We have no way of telling from the Runner side, if a length 
> > prefix has been used or not." seems false
>
> The Runner cannot inspect an unknown coder, it only has the opaque Proto
> information available which does not allow introspection of non-standard
> coders. With the current state, the Runner may think the coder adds a
> length prefix but the Python SDK worker could choose to add none. This
> produces an inconsistent key encoding. See above.

I think what's being conflated here is "the Coder has been wrapped in
a LengthPrefixCoder" vs. "the coder does length prefixing." These are
two orthogonal concepts. The runner in general only knows the former.

> It looks like the key encoding for state requests on the Python SDK
> Harness side is broken. For transferring elements of a PCollection, the
> coders are obviously working correctly, but for encoding solely the key
> of an element, there is a consistency issue.
>
>
> -Max
>
> On 06.11.19 05:35, Kenneth Knowles wrote:
> > Specifically, "We have no way of telling from the Runner side, if a
> > length prefix has been used or not." seems false. The runner has all the
> > information since length prefix is a model coder. Didn't we agree that
> > all coders should be self-delimiting in runner/SDK interactions,
> > requiring length-prefix only when there is an opaque or dynamic-length
> > value? I assume you mean that at runtime the worker for a given engine
> > does not know?
> >
> > Kenn
> >
> > On Tue, Nov 5, 2019 at 3:19 PM Luke Cwik  > > wrote:
> >
> > +1 to what Robert said.
> >
> > On Tue, Nov 5, 2019 at 2:36 PM Robert Bradshaw  > > wrote:
> >
> > The Coder used for State/Timers in a StatefulDoFn is pulled out
> > of the
> > input PCollection. If a Runner needs to partition by this coder, it
> > should ensure the coder of this PCollection matches with the Coder
> > used to create the serialized bytes that are used for partitioning
> > (whether or not this is length-prefixed).
> >
> > Concretely, the graph looks like
> >
> >
> > Runner  SDK Harness
> >
> > WriteToGbk
> >  |
> > ReadFromGbk
> >  |
> > RunnerMapFn.processKeyValue(key, value)
> >  |
> >  WriteToDataChannel
> >  >
> >   ReadFromDataChannel
> > |
> > (pcIn)
> > |
> >MyStatefulDoFn.process(key, value)
> >
> > Now the (key part of the) Coder of pcIn, which comes from the proto
> > that the Runner sent to the SDK, must match the (key part of the)
> > encoding used in WriteToGbk and ReadFromGbk. If a LenthPrefix is
> > added
> > in one spot, it must be added in the other.
> >
> >
> > [1]
> > 
> > https://github.com/apache/beam/blob/release-2.17.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L1183
> >
> > On Tue, Nov 5, 2019 at 1:25 PM 

Re: published containers overwrite locally built containers

2019-11-06 Thread Valentyn Tymofieiev
I agree with the resolutions in the link Thomas mentioned [1]. Using latest
tag is not reliable, and a unique tag ID should be generated when running
tests on Jenkins against master branch.
I think pulling the latest image for the current tag is actually a desired
behavior, in case the external image was updated (due to a bug fix for
example). Our custom container documentation should reflect this behavior.
Consider continuing the conversation in [1] to keep it in one place if
there are other suggestions/opinions.

[1]
https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E


On Fri, Nov 1, 2019 at 5:14 PM Thomas Weise  wrote:

> More here:
> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>
>
> On Fri, Nov 1, 2019 at 10:56 AM Chamikara Jayalath 
> wrote:
>
>> I think it makes sense to override published docker images with locally
>> built versions when testing HEAD.
>>
>> Thanks,
>> Cham
>>
>> On Thu, Oct 31, 2019 at 6:31 PM Heejong Lee  wrote:
>>
>>> Hi, happy halloween!
>>>
>>> I'm looking into failing cross language post commit tests:
>>> https://issues.apache.org/jira/browse/BEAM-8534
>>> 
>>>
>>> After a few runs, I've found that published SDK harness containers
>>> overwrite locally built containers when docker pull happens. I can think of
>>> two possible solutions here: 1) remove the published images with the latest
>>> tag, so make the image with the latest tag available for testing and
>>> development purposes. 2) put serialVersionUID to the class printing out the
>>> error.
>>>
>>> 2) doesn't sound like a fundamental solution if we're not going to
>>> attach serialVersionUID to all serializable classes. 1) might work but I'm
>>> not sure whether there's another use for the latest tag somewhere. Any
>>> better ideas?
>>>
>>> Thanks,
>>> Heejong
>>>
>>


Re: Jenkins workflow improvement question

2019-11-06 Thread Łukasz Gajowy
To me, any way of changing the seed job so that it does not create one
global configuration of all jobs and creates it per branch basis would be a
solid improvement to our CI so +1 if this is achievable without loosing
currently used Jenkins' features.

Łukasz

śr., 6 lis 2019 o 01:12 Alan Myrvold  napisał(a):

> It would be a nice improvement, provided (as Luke pointed out) the same
> functionality exists for triggering jobs on pull request, review comment,
> postcommit, or cron and filtering jobs based on paths of files modified.
>
> On Tue, Nov 5, 2019 at 3:49 PM Luke Cwik  wrote:
>
>> This would be great.
>>
>> I believe someone tried this in the past but it turned out somehow the
>> plugin they were using didn't support the same github hooks so we couldn't
>> have PRs start Jenkins jobs automatically and there may also have not
>> supported a path regex that allowed us to filter out Jenkins jobs from
>> running on all PRs.
>>
>> On Tue, Nov 5, 2019 at 6:13 AM Michał Walenia 
>> wrote:
>>
>>> Hi all,
>>>
>>> As those of you that work on Jenkins jobs know, they can be a pain to
>>> work with. Even simple changes are painful to run in the PR because of the
>>> seed job - it reloads all the jobs sequentially and runs for over 10
>>> minutes. If someone else runs it against another branch - tough luck, you
>>> need to retrigger it, because your configuration changes are lost.
>>>
>>> The reason it looks like that is Jenkins and its architecture. All job
>>> configurations are stored on the Jenkins master node in the form of XML job
>>> definitions. Job DSL is just a plugin which enables translation of Groovy
>>> scripts to XMLs.
>>>
>>> I looked into improving the situation and it seems that a good solution
>>> would be to move from Job DSL-defined Jenkins jobs to Jenkinsfiles. The
>>> move would make starting the seed job for every change in job configuration
>>> unnecessary, making the process much faster and more flexible for people
>>> working on the tests simultaneously.
>>>
>>> Jenkinsfile-based jobs are, in essence, very similar to conventional
>>> Jenkins jobs - the configuration is still stored as XML in the Jenkins
>>> master, but the main thing stored there is the path to a Jenkinsfile. This
>>> way, by using a plugin, it’s possible to interpret a Jenkinsfile on the fly
>>> without having the exact job step sequence stored on Jenkins master. Thus
>>> it’s possible to run a pipeline job with branch name as parameter and load
>>> a different Jenkinsfile for each branch the job is ran against.
>>>
>>> Since it is possible to define a Jenkinsfile-based pipeline job in Job
>>> DSL, we could keep the seed job and use it only to create and/or deactivate
>>> jobs as needed not to reload every, even tiny change in .jenkins directory.
>>> Actual task configurations and job steps would be defined in Jenkinsfiles,
>>> which would be read on a per-branch basis.
>>>
>>> I’m well aware that migrating all the Jenkins jobs (88 files worth of
>>> jobs) would be a huge task, but I believe that it would make future
>>> modifications of jobs much easier, as well as reduce the bottleneck of job
>>> modification testing - the seed job. It’s also probable that Jenkinsfile
>>> based job configs would be testable, which could further improve the job
>>> creation and modification process.
>>>
>>> What do you think about this?
>>>
>>>
>>> Have a good day,
>>>
>>>
>>> Michal
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>


Re: 10,000 Pull Requests

2019-11-06 Thread Łukasz Gajowy
Yay! Nice! :)

śr., 6 lis 2019 o 14:38 Maximilian Michels  napisał(a):

> Just wanted to point out, we have crossed the 10,000 PRs mark :)
>
> ...and the winner is: https://github.com/apache/beam/pull/1
>
> Seriously, I think Beam's culture to promote PRs over direct access to
> the repository is remarkable. To another 10,000 PRs!
>
> Cheers,
> Max
>


10,000 Pull Requests

2019-11-06 Thread Maximilian Michels

Just wanted to point out, we have crossed the 10,000 PRs mark :)

...and the winner is: https://github.com/apache/beam/pull/1

Seriously, I think Beam's culture to promote PRs over direct access to 
the repository is remarkable. To another 10,000 PRs!


Cheers,
Max


Re: Beam EventHubs Java connector

2019-11-06 Thread Ismaël Mejía
Hello,

We are definitely interested in this contribution. The license of the Azure
Event Hub library is MIT so there are not issues to include it in the IO
connector. You just have to take a look at the way we write IOs on Beam to wrap
it.

Please create a JIRA and assign it to yourself, also try to share your
implementation early on for feedback (even if unfinished). Remember that you can
start with the write (sending messages) part first which is probably easier.

Regards,
Ismaël

[1] https://github.com/Azure/azure-event-hubs-java/blob/dev/LICENSE


On Wed, Nov 6, 2019 at 12:07 PM Jonathan Perron
 wrote:
>
> Dear all,
>
> I will soon need to plug an Apache Beam pipeline on an Azure EventHubs
> service. I have not seen references of such connector yet, so I would
> like to know if my code would be of interest to add to a future Apache
> Beam release ? I have already seen that a Java library is available
> (https://docs.microsoft.com/fr-fr/azure/event-hubs/event-hubs-java-get-started-send),
> could it be used as a based to write the connector or should I start
> from scratch?
>
> Best regards,
>
> Jonathan
>


Beam EventHubs Java connector

2019-11-06 Thread Jonathan Perron

Dear all,

I will soon need to plug an Apache Beam pipeline on an Azure EventHubs 
service. I have not seen references of such connector yet, so I would 
like to know if my code would be of interest to add to a future Apache 
Beam release ? I have already seen that a Java library is available 
(https://docs.microsoft.com/fr-fr/azure/event-hubs/event-hubs-java-get-started-send), 
could it be used as a based to write the connector or should I start 
from scratch?


Best regards,

Jonathan



Re: Key encodings for state requests

2019-11-06 Thread Maximilian Michels

Let me try to clarify:


The Coder used for State/Timers in a StatefulDoFn is pulled out of the
input PCollection. If a Runner needs to partition by this coder, it
should ensure the coder of this PCollection matches with the Coder
used to create the serialized bytes that are used for partitioning
(whether or not this is length-prefixed).


That is essentially what I had assumed when I wrote the code. The 
problem is the coder can be "pulled out" in different ways.


For example, let's say we have the following Proto PCollection coder 
with non-standard coder "CustomCoder" as the key coder:


  KvCoder

From the Runner side, this currently looks like the following:

  PCol: KvCoder, VarIntCoder>
  Key:  LengthPrefixCoder

At the SDK Harness, we have the coder available:

  PCol: KvCoder
  Key:  CustomCoder

Currently, when the SDK Harness serializes a key for a state request, 
the custom coder may happen to add a length prefix, or it may not. It 
depends on the coder used. The correct behavior would be to use the same 
representation as on the Runner side.



Specifically, "We have no way of telling from the Runner side, if a length prefix 
has been used or not." seems false


The Runner cannot inspect an unknown coder, it only has the opaque Proto 
information available which does not allow introspection of non-standard 
coders. With the current state, the Runner may think the coder adds a 
length prefix but the Python SDK worker could choose to add none. This 
produces an inconsistent key encoding. See above.



It looks like the key encoding for state requests on the Python SDK 
Harness side is broken. For transferring elements of a PCollection, the 
coders are obviously working correctly, but for encoding solely the key 
of an element, there is a consistency issue.



-Max

On 06.11.19 05:35, Kenneth Knowles wrote:
Specifically, "We have no way of telling from the Runner side, if a 
length prefix has been used or not." seems false. The runner has all the 
information since length prefix is a model coder. Didn't we agree that 
all coders should be self-delimiting in runner/SDK interactions, 
requiring length-prefix only when there is an opaque or dynamic-length 
value? I assume you mean that at runtime the worker for a given engine 
does not know?


Kenn

On Tue, Nov 5, 2019 at 3:19 PM Luke Cwik > wrote:


+1 to what Robert said.

On Tue, Nov 5, 2019 at 2:36 PM Robert Bradshaw mailto:rober...@google.com>> wrote:

The Coder used for State/Timers in a StatefulDoFn is pulled out
of the
input PCollection. If a Runner needs to partition by this coder, it
should ensure the coder of this PCollection matches with the Coder
used to create the serialized bytes that are used for partitioning
(whether or not this is length-prefixed).

Concretely, the graph looks like


Runner                          SDK Harness

WriteToGbk
     |
ReadFromGbk
     |
RunnerMapFn.processKeyValue(key, value)
     |
     WriteToDataChannel
             >
                  ReadFromDataChannel
                                |
                            (pcIn)
                                |
                   MyStatefulDoFn.process(key, value)

Now the (key part of the) Coder of pcIn, which comes from the proto
that the Runner sent to the SDK, must match the (key part of the)
encoding used in WriteToGbk and ReadFromGbk. If a LenthPrefix is
added
in one spot, it must be added in the other.


[1]

https://github.com/apache/beam/blob/release-2.17.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L1183

On Tue, Nov 5, 2019 at 1:25 PM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >
 > Hi,
 >
 > I wanted to get your opinion on something that I have been
struggling
 > with. It is about the coders for state requests in portable
pipelines.
 >
 > In contrast to "classic" Beam, the Runner is not guaranteed
to know
 > which coder is used by the SDK. If the SDK happens to use a
standard
 > coder (also known as model coder), we will also have it
available at the
 > Runner, i.e. if the Runner is written in one of the SDK
languages (e.g.
 > Java). However, when we do not have a standard coder, we just
treat the
 > data from the SDK as a blob and just pass it around as bytes.
 >
 > Problem
 > ===
 >
 > In the case of state requests which the SDK Harness authors
to the
 > Runner, we would like for the key associated with the state
request to
 > match the key of the element which led to initiating the
state 

RE: [EXTERNAL] Re: FirestoreIO connector [JavaSDK]

2019-11-06 Thread Stefan Djelekar
Thanks for the valuable information.

We’ve been exactly looking up to that Datastore connector, but there are some 
differences of course.

I’ll make the PR in the next few days and let’s pick it up from there.

King regards,
Stefan


From: Chamikara Jayalath 
Sent: Tuesday, November 5, 2019 2:24 AM
To: dev 
Subject: [EXTERNAL] Re: FirestoreIO connector [JavaSDK]

Thanks for the contribution. Happy to help with the review.
Also, probably it'll be good to follow patterns employed by the existing 
Datastore connector when applicable: 
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java

Thanks,
Cham

On Mon, Nov 4, 2019 at 2:37 AM Ismaël Mejía 
mailto:ieme...@gmail.com>> wrote:
Hello,

Please open a PR with the code of the new connector rebased against
the latest master.

It is worth to take a look at Beam's contribution guide [1] and
PTransform style guide [2] in advance.
In any case we can guide you on any issue during the PR review so
please go ahead.

Regards,
Ismaël

[1] https://beam.apache.org/contribute/
[2] 
https://beam.apache.org/contribute/ptransform-style-guide/


On Mon, Nov 4, 2019 at 10:48 AM Stefan Djelekar
mailto:stefan.djele...@redbox.com>> wrote:
>
> Hi beam devs,
>
>
>
> I'm Stefan from Redbox. We are a customer of GCP and we are in need of beam 
> Java connector for Firestore.
>
>
>
> There is a pending JIRA item for this. 
> https://issues.apache.org/jira/browse/BEAM-8376
>
>
>
> The team inside of the company has been working on this for a while and we 
> would like to contribute!
>
> What is the best way to do this? Can we perhaps get a review from an 
> experienced beam member?
>
> Let me know what do you think.
>
>
>
> All the best,
>
>
>
> Stefan Đelekar
>
> Sofware Engineer
>
> stefan.djele...@redbox.com
>
> djelekar.com
>
>


[DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-06 Thread jincheng sun
Hi all,

I am trying to make some improvements of portability framework to make it
usable in other projects. However, we find that the coder between runner
and harness can only be FullWindowedValueCoder. This means each time when
sending a WindowedValue, we have to encode/decode timestamp, windows and
pan infos. In some circumstances(such as using the portability framework in
Flink), only values are needed between runner and harness. So, it would be
nice if we can configure the coder and avoid redundant encoding and
decoding between runner and harness to improve the performance.

There are two approaches to solve this issue:

Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
harness.
Approach 2:  Add a "constant" window coder that embeds all the
windowing information as part of the coder that should be used to wrap the
value during decoding.

More details can be found here [1].

As of the shortcomings of “Approach 2” which still need to encode/decode
timestamp and pane infos, we tend to choose “Approach 1” which brings
better performance and is more thorough.

Welcome any feedback :)

Best,
Jincheng

[1]
https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing