from:"Maximilian Michels"

Hi everyone,

I wanted to get your opinion on the Job-Server startup [1] which is part
of the portability story.

I've created a docker container to bring up Beam's Job Server, which is
the entry point for pipeline execution. Generally, this works fine when
the backend (Flink in this case) runs externally and the Job Server
connects to it.

For tests or pipeline development we may want the backend to run
embedded (inside the Job Server) which is rather problematic because the
portability requires to spin up the SDK harness in a Docker container as
well. This would happen at runtime inside the Docker container.

Since Docker inside Docker is not desirable I'm thinking about other
options:

Option 1) Instead of a Docker container, we start a bundled Job-Server
binary (or jar) when we run the pipeline. The bundle also contains an
embedded variant of the backend. For Flink, this is basically the output
of `:beam-runners-flink_2.11-job-server:shadowJar` but it is started
during pipeline execution.

Option 2) In addition to the Job Server, we let the SDK spin up another
Docker container with the backend. This is may be most applicable to all
types of backends since not all backends offer an embedded execution mode.


Keep in mind that this is only a problem for local/test execution but it
is an important aspect of Beam's usability.

What do you think? I'm leaning towards option 2. Maybe you have other
options in mind.

Cheers,
Max

[1] https://issues.apache.org/jira/browse/BEAM-4130

Re: Beam application upgrade on Flink crashes

AFAIK the serializer used here is the CoderTypeSerializer which may not
be recoverable because of changes to the contained Coder
(TaggedKvCoder). It doesn't currently have a serialVersionUID, so even
small changes could break serialization backwards-compatibility.

As of now Beam doesn't offer the same upgrade guarantees as Flink [1].
This should be improved for the next release.

Thanks,
Max

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table

On 20.08.18 17:46, Stephan Ewen wrote:
> Hi Jozef!
> 
> When restoring state, the serializer that created the state must still
> be available, so the state can be read.
> It looks like some serializer classes were removed between Beam versions
> (or changed in an incompatible manner).
> 
> Backwards compatibility of an operator implementation needs cooperation
> from the operator. Withing Flink itself, when we change the way an
> operator uses state, we keep the old codepath and classes in a
> "backwards compatibility restore" that takes the old state and brings it
> into the shape of the new state. 
> 
> I am not deeply into the of how Beam and the Flink runner implement
> their use of state, but it looks this part is not present, which could
> mean that savepoints taken from Beam applications are not backwards
> compatible.
> 
> 
> On Mon, Aug 20, 2018 at 4:03 PM, Jozef Vilcek  > wrote:
> 
> Hello,
> 
> I am attempting to upgrade  Beam app from 2.5.0 running on Flink
> 1.4.0 to Beam 2.6.0 running on Flink 1.5.0. I am not aware of any
> state migration changes needed for Flink 1.4.0 -> 1.5.0 so I am just
> starting a new App with updated libs from Flink save-point captured
> by previous version of the app.
> 
> There is not change in topology. Job is accepted without error to
> the new cluster which suggests that all operators are matched with
> state based on IDs. However, app runs only few seccons and then
> crash with:
> 
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:227)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:730)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:295)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.flink.util.FlinkException: Could not restore 
> operator state backend for 
> DoFnOperator_43996aa2908fa46bb50160f751f8cc09_(1/100) from any of the 1 
> provided restore options.
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:240)
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:139)
>   ... 5 more
> Caused by: java.io.IOException: Unable to restore operator state 
> [bundle-buffer-tag]. The previous serializer of the operator state must be 
> present; the serializer could have been removed from the classpath, or its 
> implementation have changed and could not be loaded. This is a temporary 
> restriction that will be fixed in future versions.
>   at 
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:514)
>   at 
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:63)
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>   ... 7 more
> 
> 
> Does this mean anything to anyone? Am I doing anything wrong or did
> FlinkRunner change in some way? The mentioned "bundle-buffer-tag"
> seems to be too deep internal in runner for my reach.
> 
> Any help is much appreciated.
> 
> Best,
> Jozo
> 
>

Re: Status of IntelliJ with Gradle

Thank you Etienne for opening the issue.

Anyone else having problems with the shaded Protobuf dependency?

On 20.08.18 16:14, Etienne Chauchot wrote:
> Hi Max,
> 
> I experienced the same, I had first opened a general ticket
> (https://issues.apache.org/jira/browse/BEAM-4418) about gradle
> improvements and I just split it in several tickets. Here is the one
> concerning the same issue: https://issues.apache.org/jira/browse/BEAM-5176
> 
> Etienne
> 
> Le lundi 20 août 2018 à 15:51 +0200, Maximilian Michels a écrit :
>> Hi Beamers,
>>
>> It's great to see the Beam build system overhauled. Thank you for all
>> the hard work.
>>
>> That said, I've just started contributing to Beam again and I feel
>> really stupid for not having a fully-functional IDE. I've closely
>> followed the IntelliJ/Gradle instructions [1]. In the terminal
>> everything works fine.
>>
>> First of all, I get warnings like the following and the build fails:
>>
>> 
>> .../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java:29:
>> warning: [deprecation] NonNull in edu.umd.cs.findbugs.annotations has
>> been deprecated
>> @DefaultAnnotation(NonNull.class)
>>^
>> error: warnings found and -Werror specified
>> 1 error
>> 89 warnings
>> =
>>
>> Somehow the "-Xlint:-deprecation" compiler flag does not get through but
>> "-Werror" does. I can get it to compile by removing the "-Werror" flag
>> from BeamModulePlugin but that's obviously not the solution.
>>
>> Further, once the build succeeds I still have to add the relocated
>> Protobuf library manually because the one in "vendor" does not get
>> picked up. I've tried clearing caches / re-adding the project /
>> upgrading IntelliJ / changing Gradle configs.
>>
>>
>> Is this just me or do you also have similar problems? If so, I would
>> like to compile a list of possible fixes to help other contributors.
>>
>>
>> Thanks,
>> Max
>>
>>
>> Tested with
>> - IntelliJ 2018.1.6 and 2018.2.1.
>> - MacOS
>> - java version "1.8.0_112"
>>
>> [1] https://beam.apache.org/contribute/intellij/
>>
>>

Re: Metrics architecture inside the runners

2018-08-17 Thread Maximilian Michels

Filed one: https://issues.apache.org/jira/browse/BEAM-5162

On 17.08.18 09:30, Etienne Chauchot wrote:
> Hi,
> I did not plan it but we could. Indeed, When I did the wiki page I
> searched for user doc and I found only the javadoc and the design doc
> from Ben Chambers. Maybe a page on the regular website would be needed.
> You're right, please fill a jira.
> 
> Etienne
> 
> Le jeudi 16 août 2018 à 18:24 +0200, Maximilian Michels a écrit :
>> Hi Etienne,
>>
>> Great overview. Thank you!
>>
>> When do we plan to document Metrics for users? Perhaps I should open a
>> JIRA issue.
>>
>> Cheers,
>> Max
>>
>> On 16.08.18 12:22, Etienne Chauchot wrote:
>> Hi folks !
>>
>> I've created a page in the new Beam wiki for contributors:
>>
>> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>>
>> Can you please comment it ?
>>
>> Best
>> Etienne

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-17 Thread Maximilian Michels

Hi Ankur,

Thanks for looking into this problem. The cause seems to be Flink's
pipelined execution mode. It runs multiple tasks in one task slot and
produces a deadlock when the pipelined operators schedule the SDK
harness DoFns in non-topological order.

The problem would be resolved if we scheduled the tasks in topological
order. Doing that is not easy because they run in separate Flink
operators and the SDK Harness would have to have insight into the
execution graph (which is not desirable).

The easiest method, which you proposed in 1) is to ensure that the
number of threads in the SDK harness matches the number of
ExecutableStage DoFn operators.

The approach in 2) is what Flink does as well. It glues together
horizontal parts of the execution graph, also in multiple threads. So I
agree with your proposed solution.

Best,
Max

On 17.08.18 03:10, Ankur Goenka wrote:
> Hi,
> 
> tl;dr Dead Lock in task execution caused by limited task parallelism on
> SDKHarness.
> 
> *Setup:*
> 
>   * Job type: /*Beam Portable Python Batch*/ Job on Flink standalone
> cluster.
>   * Only a single job is scheduled on the cluster.
>   * Everything is running on a single machine with single Flink task
> manager.
>   * Flink Task Manager Slots is 1.
>   * Flink Parallelism is 1.
>   * Python SDKHarness has 1 thread.
> 
> *Example pipeline:*
> Read -> MapA -> GroupBy -> MapB -> WriteToSink
> 
> *Issue:*
> With multi stage job, Flink schedule different dependent sub tasks
> concurrently on Flink worker as long as it can get slots. Each map tasks
> are then executed on SDKHarness.
> Its possible that MapB gets to SDKHarness before MapA and hence gets
> into the execution queue before MapA. Because we only have 1 execution
> thread on SDKHarness, MapA will never get a chance to execute as MapB
> will never release the execution thread. MapB will wait for input from
> MapA. This gets us to a dead lock in a simple pipeline.
> 
> *Mitigation:*
> Set worker_count in pipeline options more than the expected sub tasks
> in pipeline.
> 
> *Proposal:*
> 
>  1. We can get the maximum concurrency from the runner and make sure
> that we have more threads than max concurrency. This approach
> assumes that Beam has insight into runner execution plan and can
> make decision based on it.
>  2. We dynamically create thread and cache them with a high upper bound
> in SDKHarness. We can warn if we are hitting the upper bound of
> threads. This approach assumes that runner does a good job of
> scheduling and will distribute tasks more or less evenly.
> 
> We expect good scheduling from runners so I prefer approach 2. It is
> simpler to implement and the implementation is not runner specific. This
> approach better utilize resource as it creates only as many threads as
> needed instead of the peak thread requirement.
> And last but not the least, it gives runner control over managing truly
> active tasks.
> 
> Please let me know if I am missing something and your thoughts on the
> approach.
> 
> Thanks,
> Ankur
>

Re: Bootstrapping Beam's Job Server

2018-08-21 Thread Maximilian Michels

Thanks Henning and Thomas. It looks like

a) we want to keep the Docker Job Server Docker container and rely on
spinning up "sibling" SDK harness containers via the Docker socket. This 
should require little changes to the Runner code.

b) have the InProcess SDK harness as an alternative way to running user
code. This can be done independently of a).

Thomas, let's sync today on the InProcess SDK harness. I've created a
JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187

Cheers,
Max

On 21.08.18 00:35, Thomas Weise wrote:
The original objective was to make test/development easier (which I 
think is super important for user experience with portable runner).

 From first hand experience I can confirm that dealing with Flink 
clusters and Docker containers for local setup is a significant hurdle 
for Python developers.

To simplify using Flink in embedded mode, the (direct) process based SDK 
harness would be a good option, especially when it can be linked to the 
same virtualenv that developers have already setup, eliminating extra 
packaging/deployment steps.

Max, I would be interested to sync up on what your thoughts are 
regarding that option since you mention you also started to work on it 
(see previous discussion [1], not sure if there is a JIRA for it yet). 
Internally we are planning to use a direct SDK harness process instead 
of Docker containers. For our specific needs it will works equally well 
for development and production, including future plans to deploy Flink 
TMs via Kubernetes.

Thanks,
Thomas

[1] 
https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E

On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels <mailto:m...@apache.org>> wrote:

Thanks for your suggestions. Please see below.

 > Option 3) would be to map in the docker binary and socket to allow
 > the containerized Flink job server to start "sibling" containers on
 > the host.

Do you mean packaging Docker inside the Job Server container and
mounting /var/run/docker.sock from the host inside the container? That
looks like a bit of a hack but for testing it could be fine.

 > notably, if the runner supports auto-scaling or similar non-trivial
 > configurations, that would be difficult to manage from the SDK side.

You're right, it would be unfortunate if the SDK would have to deal with
spinning up SDK harness/backend containers. For non-trivial
configurations it would probably require an extended protocol.

 > Option 4) We are also thinking about adding process based SDKHarness.
 > This will avoid docker in docker scenario.

Actually, I had started implementing a process-based SDK harness but
figured it might be impractical because it doubles the execution path
for UDF code and potentially doesn't work with custom dependencies.

 > Process based SDKHarness also has other applications and might be
 > desirable in some of the production use cases.

True. Some users might want something more lightweight.

--
Max

Re: Beam Docs Contributor

2018-08-21 Thread Maximilian Michels

That sounds great, Rose. Welcome!

On 21.08.18 09:21, Etienne Chauchot wrote:
> Welcome Rose !
> 
> Etienne
> 
> Le lundi 30 juillet 2018 à 10:10 -0700, Thomas Weise a écrit :
>> Welcome Rose, and looking forward to the docs update!
>>
>> On Mon, Jul 30, 2018 at 9:15 AM Henning Rohde > > wrote:
>>> Welcome Rose! Great to have you here.
>>>
>>> On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía >> > wrote:
 Welcome !
 Great to see someone new working in this important area for the project.


 On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang >>> > wrote:
> Welcome Rose!
> ᐧ
>
> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  > wrote:
>> Welcome!
>>
>> -Rui
>>
>> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas > > wrote:
>>> Welcome Rose, very glad to have you in the community :)
>>>
>>>
>>>
>>> On Fri, 27 Jul 2018 at 16:29, Ahmet Altay >> > wrote:
 Welcome Rose! Looking forward to your contributions.

 On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen
 mailto:rtngu...@google.com>> wrote:
> Hi all:
>
> I'm Rose! I've worked on Cloud Dataflow documentation and now
> I'm starting a project to refresh the Beam docs and improve the
> onboarding experience. We're planning on splitting up the
> programming guide into multiple pages, making the docs more
> accessible for new users. I've got lots of ideas for doc
> improvements, some of which are motivated by the UX research,
> and am excited to share them with you all and work on them. 
>
> I look forward to interacting with everybody in the community.
> I welcome comments, thoughts, feedback, etc. 
> -- 
>
>   
>   
>
> Rose Thi Nguyen
>
>   Technical Writer
>
> (281) 683-6900
>

Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Maximilian Michels

For reference, here is corresponding JIRA issue for this thread:
https://issues.apache.org/jira/browse/BEAM-5187

On 16.08.18 11:15, Maximilian Michels wrote:

Makes sense to have an option to run the SDK harness in a non-dockerized
environment.

I'm in the process of creating a Docker entry point for Flink's
JobServer[1]. I suppose you would also prefer to execute that one
standalone. We should make sure this is also an option.

[1] https://issues.apache.org/jira/browse/BEAM-4130

On 16.08.18 07:42, Thomas Weise wrote:

Yes, that's the proposal. Everything that would otherwise be packaged
into the Docker container would need to be pre-installed in the host
environment. In the case of Python SDK, that could simply mean a
(frozen) virtual environment that was setup when the host was
provisioned - the SDK harness process(es) will then just utilize that.
Of course this flavor of SDK harness execution could also be useful in
the local development environment, where right now someone who already
has the Python environment needs to also install Docker and update a
container to launch a Python SDK pipeline on the Flink runner.

On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira mailto:danolive...@google.com>> wrote:

I just want to clarify that I understand this correctly since I'm
not that familiar with the details behind all these execution
environments yet. Is the proposal to create a new JobBundleFactory
that instead of using Docker to create the environment that the new
processes will execute in, this JobBundleFactory would execute the
new processes directly in the host environment? So in practice if I
ran a pipeline with this JobBundleFactory the SDK Harness and Runner
Harness would both be executing directly on my machine and would
depend on me having the dependencies already present on my machine?

On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka mailto:goe...@google.com>> wrote:

Thanks for starting the discussion. I will be happy to help.
I agree, we should have pluggable SDKHarness environment Factory.
We can register multiple Environment factory using service
registry and use the PipelineOption to pick the right one on per
job basis.

There are a couple of things which are require to setup before
launching the process.

* Setting up the environment as done in boot.go [4]
* Retrieving and putting the artifacts in the right location.

You can probably leverage boot.go code to setup the environment.

Also, it will be useful to enumerate pros and cons of different
Environments to help users choose the right one.

On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise mailto:t...@apache.org>> wrote:

Hi,

Currently the portable Flink runner only works with SDK
Docker containers for execution (DockerJobBundleFactory,
besides an in-process (embedded) factory option for testing
[1]). I'm considering adding another out of process
JobBundleFactory implementation that directly forks the
processes on the task manager host, eliminating the need for
Docker. This would work reasonably well in environments
where the dependencies (in this case Python) can easily be
tied into the host deployment (also within an application
specific Kubernetes pod).

There was already some discussion about alternative
JobBundleFactory implementation in [2]. There is also a JIRA
to make the bundle factory pluggable [3], pending
availability of runner level options.

For a "ProcessBundleFactory", in addition to the Python
dependencies the environment would also need to have the Go
boot executable [4] (or a substitute thereof) to perform the
harness initialization.

Is anyone else interested in this SDK execution option or
has already investigated an alternative implementation?

Thanks,
Thomas

[1]

https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83

[2]

https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E

[3] https://issues.apache.org/jira/browse/BEAM-4819

[4]
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go

--
Max

Re: Status of IntelliJ with Gradle

2018-08-22 Thread Maximilian Michels

Thanks Lukasz. I also found that I can never fix all import errors by 
manually adding jars to the IntelliJ library list. It is also not a good 
solution because it breaks on reloading the Gradle project.

New contributors might find the errors in IntelliJ distracting. Even 
worse, they might assume the problem is on their side. If we can't fix 
them soon, I'd suggest documenting the IntelliJ limitations in the 
contributor guide.

On 20.08.18 17:58, Lukasz Cwik wrote:
Yes, I have the same issues with vendoring. These are the things that I 
have tried without success to get Intellij to import the vendored 
modules correctly:
* attempted to modify the idea.module.scopes to only include the 
vendored artifacts (for some reason this is ignored and Intellij is 
relying on the output of its own internal module, nothing I add to the 
scopes seems to impact anything)
* modify the generated iml beforehand to add the vendored jar file as 
the top dependency (jar never appears in the modules dependencies)

On Mon, Aug 20, 2018 at 8:36 AM Maximilian Michels <mailto:m...@apache.org>> wrote:

Thank you Etienne for opening the issue.

Anyone else having problems with the shaded Protobuf dependency?

On 20.08.18 16:14, Etienne Chauchot wrote:
 > Hi Max,
 >
 > I experienced the same, I had first opened a general ticket
 > (https://issues.apache.org/jira/browse/BEAM-4418) about gradle
 > improvements and I just split it in several tickets. Here is the one
 > concerning the same issue:
https://issues.apache.org/jira/browse/BEAM-5176
 >
 > Etienne
 >
 > Le lundi 20 août 2018 à 15:51 +0200, Maximilian Michels a écrit :
 >> Hi Beamers,
 >>
 >> It's great to see the Beam build system overhauled. Thank you
for all
 >> the hard work.
 >>
 >> That said, I've just started contributing to Beam again and I feel
 >> really stupid for not having a fully-functional IDE. I've closely
 >> followed the IntelliJ/Gradle instructions [1]. In the terminal
 >> everything works fine.
 >>
 >> First of all, I get warnings like the following and the build fails:
 >>
 >> 
 >>

.../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java:29:
 >> warning: [deprecation] NonNull in
edu.umd.cs.findbugs.annotations has
 >> been deprecated
 >> @DefaultAnnotation(NonNull.class)
 >>                    ^
 >> error: warnings found and -Werror specified
 >> 1 error
 >> 89 warnings
 >> =
 >>
 >> Somehow the "-Xlint:-deprecation" compiler flag does not get
through but
 >> "-Werror" does. I can get it to compile by removing the
"-Werror" flag
 >> from BeamModulePlugin but that's obviously not the solution.
 >>
 >> Further, once the build succeeds I still have to add the relocated
 >> Protobuf library manually because the one in "vendor" does not get
 >> picked up. I've tried clearing caches / re-adding the project /
 >> upgrading IntelliJ / changing Gradle configs.
 >>
 >>
 >> Is this just me or do you also have similar problems? If so, I would
 >> like to compile a list of possible fixes to help other contributors.
 >>
 >>
 >> Thanks,
 >> Max
 >>
 >>
 >> Tested with
 >> - IntelliJ 2018.1.6 and 2018.2.1.
 >> - MacOS
 >> - java version "1.8.0_112"
 >>
 >> [1] https://beam.apache.org/contribute/intellij/
 >>
 >>

--
Max

Re: Bootstrapping Beam's Job Server

> Going down this path may start to get fairly involved, with an almost
> endless list of features that could be requested. Instead, I would
> suggest we keep process-based execution very simple, and specify bash
> script (that sets up the environment and whatever else one may want to
> do) as the command line invocation.

Fair point. At the least, we will have to transfer the shell script to 
the nodes. Anything else is up to the script.

> I would also think it'd be really valuable to provide a "callback"
> environment, where an RPC call is made to trigger worker creation
> (deletion?), passing the requisite parameters (e.g. the fn api
> endpoints).

Aren't you making up more features now? :) Couldn't this be also handled 
by the shell script?

On 23.08.18 14:13, Robert Bradshaw wrote:

On Thu, Aug 23, 2018 at 1:54 PM Maximilian Michels  wrote:

Big +1. Process-based execution should be simple to reason about for
users.

+1. In fact, this is exactly what the Python local job server does,
with running Docker simply being a particular command line that's
passed down here.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service_main.py

The implementation should not be too involved. The user has to
ensure the environment is suitable for process-based execution.

There are some minor features that we should support:

- Activating a virtual environment for Python / Adding pre-installed
libraries to the classpath

- Staging libraries, similarly to the boot code for Docker

Going down this path may start to get fairly involved, with an almost
endless list of features that could be requested. Instead, I would
suggest we keep process-based execution very simple, and specify bash
script (that sets up the environment and whatever else one may want to
do) as the command line invocation. We could even provide a couple of
these. (The arguments to pass should be configurable).

I would also think it'd be really valuable to provide a "callback"
environment, where an RPC call is made to trigger worker creation
(deletion?), passing the requisite parameters (e.g. the fn api
endpoints). This could be useful both in a distributed system (where
it may be desirable for an external entity to actually start up the
workers) or for debugging/testing (where one could call into the same
process that submitted the job, which would execute workers on
separate threads with an already set up environment).

On 22.08.18 07:49, Henning Rohde wrote:

Agree with Luke. Perhaps something simple, prescriptive yet flexible,
such as custom command line (defined in the environment proto) rooted at
the base of the provided artifacts and either passed the same arguments
or defined in the container contract or made available through
substitution. That way, all the restrictions/assumptions of the
execution environment become implicit and runner/deployment dependent.

On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

 I believe supporting a simple Process environment makes sense. It
 would be best if we didn't make the Process route solve all the
 problems that Docker solves for us. In my opinion we should limit
 the Process route to assume that the execution environment:
 * has all dependencies and libraries installed
 * is of a compatible machine architecture
 * doesn't require special networking rules to be setup

 Any other suggestions for reasonable limits on a Process environment?

 On Tue, Aug 21, 2018 at 2:53 AM Ismaël Mejía mailto:ieme...@gmail.com>> wrote:

 It is also worth to mention that apart of the
 testing/development use
 case there is also the case of supporting people running in Hadoop
 distributions. There are two extra reasons to want a process based
 version: (1) Some Hadoop distributions run in machines with
 really old
 kernels where docker support is limited or nonexistent (yes, some of
 those run on kernel 2.6!) and (2) Ops people may be reticent to the
 additional operational overhead of enabling docker in their
     clusters.
 On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels
 mailto:m...@apache.org>> wrote:
  >
  > Thanks Henning and Thomas. It looks like
  >
  > a) we want to keep the Docker Job Server Docker container and
 rely on
  > spinning up "sibling" SDK harness containers via the Docker
 socket. This
  > should require little changes to the Runner code.
  >
  > b) have the InProcess SDK harness as an alternative way to
 running user
  > code. This can be done independently of a).
  >
  > Thomas, let's sync today on the InProcess SDK harness. I've
 created a
  &

Re: [Proposal] Track non-code contributions in Jira

2018-08-24 Thread Maximilian Michels

+1 Code is just one part of a successful open-source project. As long as 
the tasks are properly labelled and actionable, I think it works to put 
them into JIRA.


On 24.08.18 15:09, Matthias Baetens wrote:


I fully agree and think it is a great idea.

I think that, next to visibility and keeping track of everything that is 
going on in the community, the other goal would be documenting best 
practices for future use.


I am also not sure, though, if JIRA is the best place to do so, as 
Austin raised.
Introducing (yet) another tool on the other hand, might also not be 
ideal. Has anyone else experience with this from other Apache projects?


On Fri, 24 Aug 2018 at 06:04 Austin Bennett > wrote:


Certainly tracking and managing these are important -- though, is
Jira the best tool for these things?

I do see it useful to put in Jira tickets in for my director to have
conversations on specific topics with people, for consensus
building, etc etc.  So, I have seen it work even for non-coding tasks.

It seems like much of #s 2-6 mentioned requires project management
applied to those specific domains and is applicable elsewhere,
wondering what constitutes "pure" project management in #1 (as it
applies here)...?  In that light I'm just getting picky about
taxonomy :-)





On Thu, Aug 23, 2018 at 3:10 PM Alan Myrvold mailto:amyrv...@google.com>> wrote:

I like the idea of recognizing non-code contributions. These
other efforts have been very helpful.

On Thu, Aug 23, 2018 at 3:07 PM Griselda Cuevas mailto:g...@google.com>> wrote:

Hi Beam Community,

I'd like to start tracking non-code contributions for Beam,
specially around these six categories:
1) Project Management
2) Community Management
3) Advocacy
4) Events & Meetups
5) Documentation
6) Training

The proposal would be to create six boards in Jira, one per
proposed category, and as part of this initiative also clean
the already existing "Project Management" component, i.e.
making sure all issues there are still relevant.

After this, I'd also create a landing page in the website
that talks about all types of contributions to the project.

The reason for doing this is mainly to give visibility to
some of the great work our community does beyond code pushes
in Github. Initiatives around Beam are starting to spark
around the world, and it'd be great to become an Apache
project recognized for our outstanding community recognition.

What are your thoughts?
G

Gris

--


--
Max

Re: [Perk] Sharing the love for Flink Forward

Just wanted to chime in here and say that Flink Forward is a great 
conference. You get to meet lots of people from the Flink community from 
all over the world, committers as well as end users. There are awesome 
talks as well. Plus, you get to travel to Berlin which, if you haven't 
been, I highly recommend.


So I expect the passes to be gone soon :)

Cheers,
Max

On 25.08.18 01:45, Griselda Cuevas wrote:

Hi Beam Community!

As you know, Apache Beam is a big supporter of Apache Flink - pun 
intended ;) - and because Google Cloud is a proud sponsor of Flink 
Forward we have 10 passes for members of our Apache Beam community to 
attend the event in Berlin this Sept. 3rd & 4th for free.


If you're interested in one of the passes please reach out to me 
(gris[at]apache[dot]org), the only requisite is obviously that you love 
the Flink+Beam combo.


Happy Friday!
G


--
Max

Re: Bootstrapping Beam's Job Server

Robert, just to be clear about the "callback" proposal. Do you mean that 
the process startup script listens for an RPC from the Runner to bring 
up SDK harnesses as needed?

I agree this would be helpful to know the required parameter, e.g. you 
mentioned the Fn Api network configuration.

On 23.08.18 17:07, Robert Bradshaw wrote:

On Thu, Aug 23, 2018 at 3:47 PM Maximilian Michels  wrote:

  > Going down this path may start to get fairly involved, with an almost
  > endless list of features that could be requested. Instead, I would
  > suggest we keep process-based execution very simple, and specify bash
  > script (that sets up the environment and whatever else one may want to
  > do) as the command line invocation.

Fair point. At the least, we will have to transfer the shell script to
the nodes. Anything else is up to the script.

  > I would also think it'd be really valuable to provide a "callback"
  > environment, where an RPC call is made to trigger worker creation
  > (deletion?), passing the requisite parameters (e.g. the fn api
  > endpoints).

Aren't you making up more features now? :) Couldn't this be also handled
by the shell script?

Good point :). I still think it'd be nice to make this option more
explicit, as it doesn't even require starting up (or managing) a
subprocess.

On 23.08.18 14:13, Robert Bradshaw wrote:

On Thu, Aug 23, 2018 at 1:54 PM Maximilian Michels  wrote:

Big +1. Process-based execution should be simple to reason about for
users.

+1. In fact, this is exactly what the Python local job server does,
with running Docker simply being a particular command line that's
passed down here.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service_main.py

The implementation should not be too involved. The user has to
ensure the environment is suitable for process-based execution.

There are some minor features that we should support:

- Activating a virtual environment for Python / Adding pre-installed
libraries to the classpath

- Staging libraries, similarly to the boot code for Docker

Going down this path may start to get fairly involved, with an almost
endless list of features that could be requested. Instead, I would
suggest we keep process-based execution very simple, and specify bash
script (that sets up the environment and whatever else one may want to
do) as the command line invocation. We could even provide a couple of
these. (The arguments to pass should be configurable).

I would also think it'd be really valuable to provide a "callback"
environment, where an RPC call is made to trigger worker creation
(deletion?), passing the requisite parameters (e.g. the fn api
endpoints). This could be useful both in a distributed system (where
it may be desirable for an external entity to actually start up the
workers) or for debugging/testing (where one could call into the same
process that submitted the job, which would execute workers on
separate threads with an already set up environment).

On 22.08.18 07:49, Henning Rohde wrote:

Agree with Luke. Perhaps something simple, prescriptive yet flexible,
such as custom command line (defined in the environment proto) rooted at
the base of the provided artifacts and either passed the same arguments
or defined in the container contract or made available through
substitution. That way, all the restrictions/assumptions of the
execution environment become implicit and runner/deployment dependent.

On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

  I believe supporting a simple Process environment makes sense. It
  would be best if we didn't make the Process route solve all the
  problems that Docker solves for us. In my opinion we should limit
  the Process route to assume that the execution environment:
  * has all dependencies and libraries installed
  * is of a compatible machine architecture
  * doesn't require special networking rules to be setup

  Any other suggestions for reasonable limits on a Process environment?

  On Tue, Aug 21, 2018 at 2:53 AM Ismaël Mejía mailto:ieme...@gmail.com>> wrote:

  It is also worth to mention that apart of the
  testing/development use
  case there is also the case of supporting people running in Hadoop
  distributions. There are two extra reasons to want a process based
  version: (1) Some Hadoop distributions run in machines with
  really old
  kernels where docker support is limited or nonexistent (yes, some of
  those run on kernel 2.6!) and (2) Ops people may be reticent to the
  additional operational overhead of enabling docker in their
  clusters.
  On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels
  mailto:m...@apache.org>> wrote:
   >
   > Thanks Henning and Thomas.

Re: Bootstrapping Beam's Job Server

Understood, so that's a generalized abstraction for creating RPC-based
services that manage SDK harnesses. (What we discussed as "external" in
the other thread). Would prefer this REST-based, since this makes
interfacing with other systems easier. So probably a shell script would
already suffice.

On 27.08.18 11:23, Robert Bradshaw wrote:

I mean that rather than a command line (or docker image) a URL is
given that's a GRPC (or REST or ...) endpoint that's invoked to pass
what would have been passed by command line arguments (e.g. the FnAPI
control plane and logging endpoints).

This could be implemented as a script that goes and makes the call and
exits, but I think this would be common enough it'd be worth building
in, and also useful enough for testing that it should be very
lightweight.
On Mon, Aug 27, 2018 at 10:51 AM Maximilian Michels wrote:

Robert, just to be clear about the "callback" proposal. Do you mean that
the process startup script listens for an RPC from the Runner to bring
up SDK harnesses as needed?

I agree this would be helpful to know the required parameter, e.g. you
mentioned the Fn Api network configuration.

On 23.08.18 17:07, Robert Bradshaw wrote:

On Thu, Aug 23, 2018 at 3:47 PM Maximilian Michels wrote:

> Going down this path may start to get fairly involved, with an almost
> endless list of features that could be requested. Instead, I would
> suggest we keep process-based execution very simple, and specify bash
> script (that sets up the environment and whatever else one may want to
> do) as the command line invocation.

Fair point. At the least, we will have to transfer the shell script to
the nodes. Anything else is up to the script.

> I would also think it'd be really valuable to provide a "callback"
> environment, where an RPC call is made to trigger worker creation
> (deletion?), passing the requisite parameters (e.g. the fn api
> endpoints).

Aren't you making up more features now? :) Couldn't this be also handled
by the shell script?

Good point :). I still think it'd be nice to make this option more
explicit, as it doesn't even require starting up (or managing) a
subprocess.

On 23.08.18 14:13, Robert Bradshaw wrote:

On Thu, Aug 23, 2018 at 1:54 PM Maximilian Michels wrote:

Big +1. Process-based execution should be simple to reason about for
users.

+1. In fact, this is exactly what the Python local job server does,
with running Docker simply being a particular command line that's
passed down here.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/local_job_service_main.py

The implementation should not be too involved. The user has to
ensure the environment is suitable for process-based execution.

There are some minor features that we should support:

- Activating a virtual environment for Python / Adding pre-installed
libraries to the classpath

- Staging libraries, similarly to the boot code for Docker

Going down this path may start to get fairly involved, with an almost
endless list of features that could be requested. Instead, I would
suggest we keep process-based execution very simple, and specify bash
script (that sets up the environment and whatever else one may want to
do) as the command line invocation. We could even provide a couple of
these. (The arguments to pass should be configurable).

I would also think it'd be really valuable to provide a "callback"
environment, where an RPC call is made to trigger worker creation
(deletion?), passing the requisite parameters (e.g. the fn api
endpoints). This could be useful both in a distributed system (where
it may be desirable for an external entity to actually start up the
workers) or for debugging/testing (where one could call into the same
process that submitted the job, which would execute workers on
separate threads with an already set up environment).

On 22.08.18 07:49, Henning Rohde wrote:

Agree with Luke. Perhaps something simple, prescriptive yet flexible,
such as custom command line (defined in the environment proto) rooted at
the base of the provided artifacts and either passed the same arguments
or defined in the container contract or made available through
substitution. That way, all the restrictions/assumptions of the
execution environment become implicit and runner/deployment dependent.

On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

I believe supporting a simple Process environment makes sense. It
would be best if we didn't make the Process route solve all the
problems that Docker solves for us. In my opinion we should limit
the Process route to assume that the execution environment:
* has all dependencies and libraries installed
* is of a compatible machine architecture
* doesn't require special networking rules to be setup

Any other suggestions for

Re: Process JobBundleFactory for portable runner

Thanks for your proposal Henning. +1 for explicit environment messages. 
I'm not sure how important it is to support cross-platform pipelines. I 
can foresee future use but I wouldn't consider it essential. However, it 
basically comes for free if we extend the existing environment 
information for ExecutableStage. The overhead, as you said, is negligible.

Also agree that artifact staging is important even with process-based 
execution. The execution environment might be managed externally but we 
still want to be able to execute new pipelines without copying over 
required artifact. That said, a first version could come without 
artifact staging.

On 23.08.18 18:14, Henning Rohde wrote:
A process-based SDK harness does not IMO imply that the host is fully 
provisioned by the SDK/user and invoking the user command line in the 
context of the staged files is a critical aspect for it to work. So I 
consider staged artifact support needed. Also, I would like to suggest 
that we move to a concrete environment proto to crystalize what is 
actually being proposed. I'm not sure what activating a virtualenv would 
look like, for example. To start things off:

message Environment {
   string urn = 1;
   bytes payload = 2;
}

// urn == "beam:env:docker:v1"
message DockerPayload {
   string container_image = 1;  // implicitly linux_amd64.
}

// urn == "beam:env:process:v1"
message ProcessPayload {
   string os = 1;  // "linux", "darwin", ..
   string arch = 2;  // "amd64", ..
   string command_line = 3;
}

// urn == "beam:env:external:v1"
// (no payload)

A runner may support any subset and reject any unsupported 
configuration. There are 3 kinds of environments that I think are useful:
  (1) docker: works as currently. Offers the most flexibility for SDKs 
and users, especially when the runner is outside the control (such 
as hosted runners). The runner starts the SDK harnesses.
  (2) process: as discussed here. The runner starts the SDK harnesses. 
The semantics is that the shell commandline is invoked in a directory 
rooted in the staged artifacts with the container contract arguments. It 
is up to the user and runner deployment to ensure that it makes sense, 
i.e., on windows a linux binary or bash script is not specified. 
Executing the user command in a shell env (bash, zsh, cmd, ..) ensures 
that paths and so on are set up:, i.e., specifying "java -jar foo" would 
actually work. Useful for cases where the user controls both the SDK and 
runner (such as locally) or when docker is not an option. Intended to be 
minimal and SDK/language agnostic.
  (3) external: this is what I think Robert was alluding to. The runner 
does not start any SDK harnesses. Instead it waits for user-controlled 
SDK harnesses to connect. Useful for manually debugging SDK code 
(connect from code running in a debugger) or when the user code must run 
in a special or privileged environment. It's runner-specific how the SDK 
will need to connect.

Part of the idea of placing this information in the environment is that 
pipelines can potentially use multiple, such as cross-windows/linux.

Henning

On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <mailto:t...@apache.org>> wrote:

I would see support for staging libraries as optional / nice to have
since that can also be done as part of host provisioning (i.e. in
the Python case a virtual environment was already setup and just
needs to be activated).

Depending on how the command that launches the harness is
configured, additional steps such as virtualenv activate or setting
of other environment variables can be included as well.

On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Just to recap:

  From this and the other thread ("Bootstraping Beam's Job
Server") we
got sufficient evidence that process-based execution is a
desired feature.

Process-based execution as an alternative to dockerized execution
https://issues.apache.org/jira/browse/BEAM-5187

Which parts are executed as a process?
=> The SDK harness for user code

What configuration options are supported?
=> Provide information about the target architecture (OS/CPU)
=> Staging libraries, as also supported by Docker
    => Activating a pre-existing environment (e.g. virutalenv)

On 23.08.18 14:13, Maximilian Michels wrote:
 >> One thing to consider that we've talked about in the past.
It might
 >> make sense to extend the environment proto and have the SDK be
 >> explicit about which kinds of environment it support
 >
 > +1 Encoding environment information there is a good idea.
 >
 >> Seems it will create a default docker url even if the
 &

Re: Process JobBundleFactory for portable runner

One thing to consider that we've talked about in the past. It might make sense 
to extend the environment proto and have the SDK be explicit about which kinds 
of environment it support

+1 Encoding environment information there is a good idea.

Seems it will create a default docker url even if the 
hardness_docker_image is set to None in pipeline options. Shall we add 
another option or honor the None in this option to support the process 
job? 

Yes, if no Docker image is set the default one will be used. Currently 
Docker is the only way to execute pipelines with the PortableRunner. If 
the docker_image is not set, execution won't succeed.

On 22.08.18 22:59, Xinyu Liu wrote:
We are also interested in this Process JobBundleFactory as we are 
planning to fork a process to run python sdk in Samza runner, instead of 
using docker container. So this change will be helpful to us too. On the 
same note, we are trying out portable_runner.py to submit a python job. 
Seems it will create a default docker url even if the 
hardness_docker_image is set to None in pipeline options. Shall we add 
another option or honor the None in this option to support the process 
job? I made some local changes right now to walk around this.

Thanks,
Xinyu

On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <mailto:hero...@google.com>> wrote:

By "enum" in quotes, I meant the usual open URN style pattern not an
actual enum. Sorry if that wasn't clear.

On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik mailto:lc...@google.com>> wrote:

I would model the environment to be more free form then enums
such that we have forward looking extensibility and would
suggest to follow the same pattern we use on PTransforms (using
an URN and a URN specific payload). Note that in this case we
may want to support a list of supported environments (e.g. java,
docker, python, ...).

On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
mailto:hero...@google.com>> wrote:

One thing to consider that we've talked about in the past.
It might make sense to extend the environment proto and have
the SDK be explicit about which kinds of environment it
supports:

https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969

<https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>

This choice might impact what files are staged or what not.
Some SDKs, such as Go, provide a compiled binary and _need_
to know what the target architecture is. Running on a mac
with and without docker, say, requires a different worker in
each case. If we add an "enum", we can also easily add the
external idea where the SDK/user starts the SDK harnesses
instead of the runner. Each runner may not support all types
of environments.

Henning

    On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

For reference, here is corresponding JIRA issue for this
thread:
https://issues.apache.org/jira/browse/BEAM-5187
<https://issues.apache.org/jira/browse/BEAM-5187>

On 16.08.18 11:15, Maximilian Michels wrote:
 > Makes sense to have an option to run the SDK harness
in a non-dockerized
 > environment.
 >
 > I'm in the process of creating a Docker entry point
for Flink's
 > JobServer[1]. I suppose you would also prefer to
execute that one
 > standalone. We should make sure this is also an option.
 >
 > [1] https://issues.apache.org/jira/browse/BEAM-4130
<https://issues.apache.org/jira/browse/BEAM-4130>
 >
 > On 16.08.18 07:42, Thomas Weise wrote:
 >> Yes, that's the proposal. Everything that would
otherwise be packaged
 >> into the Docker container would need to be
pre-installed in the host
 >> environment. In the case of Python SDK, that could
simply mean a
 >> (frozen) virtual environment that was setup when the
host was
 >> provisioned - the SDK harness process(es) will then
just utilize that.
 >> Of course this flavor of SDK harness execution could
also be useful in
 >> the local development environment, where

Re: Bootstrapping Beam's Job Server

Big +1. Process-based execution should be simple to reason about for 
users. The implementation should not be too involved. The user has to 
ensure the environment is suitable for process-based execution.

There are some minor features that we should support:

- Activating a virtual environment for Python / Adding pre-installed 
libraries to the classpath

- Staging libraries, similarly to the boot code for Docker

On 22.08.18 07:49, Henning Rohde wrote:
Agree with Luke. Perhaps something simple, prescriptive yet flexible, 
such as custom command line (defined in the environment proto) rooted at 
the base of the provided artifacts and either passed the same arguments 
or defined in the container contract or made available through 
substitution. That way, all the restrictions/assumptions of the 
execution environment become implicit and runner/deployment dependent.

On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik <mailto:lc...@google.com>> wrote:

I believe supporting a simple Process environment makes sense. It
would be best if we didn't make the Process route solve all the
problems that Docker solves for us. In my opinion we should limit
the Process route to assume that the execution environment:
* has all dependencies and libraries installed
* is of a compatible machine architecture
* doesn't require special networking rules to be setup

Any other suggestions for reasonable limits on a Process environment?

On Tue, Aug 21, 2018 at 2:53 AM Ismaël Mejía mailto:ieme...@gmail.com>> wrote:

It is also worth to mention that apart of the
testing/development use
case there is also the case of supporting people running in Hadoop
distributions. There are two extra reasons to want a process based
version: (1) Some Hadoop distributions run in machines with
really old
kernels where docker support is limited or nonexistent (yes, some of
those run on kernel 2.6!) and (2) Ops people may be reticent to the
additional operational overhead of enabling docker in their
clusters.
On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >
 > Thanks Henning and Thomas. It looks like
 >
 > a) we want to keep the Docker Job Server Docker container and
rely on
 > spinning up "sibling" SDK harness containers via the Docker
socket. This
 > should require little changes to the Runner code.
 >
 > b) have the InProcess SDK harness as an alternative way to
running user
 > code. This can be done independently of a).
 >
 > Thomas, let's sync today on the InProcess SDK harness. I've
created a
 > JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187
 >
 > Cheers,
 > Max
 >
 > On 21.08.18 00:35, Thomas Weise wrote:
 > > The original objective was to make test/development easier
(which I
 > > think is super important for user experience with portable
runner).
 > >
 > >  From first hand experience I can confirm that dealing with
Flink
 > > clusters and Docker containers for local setup is a
significant hurdle
 > > for Python developers.
 > >
 > > To simplify using Flink in embedded mode, the (direct)
process based SDK
 > > harness would be a good option, especially when it can be
linked to the
 > > same virtualenv that developers have already setup,
eliminating extra
 > > packaging/deployment steps.
 > >
 > > Max, I would be interested to sync up on what your thoughts are
 > > regarding that option since you mention you also started to
work on it
 > > (see previous discussion [1], not sure if there is a JIRA
for it yet).
 > > Internally we are planning to use a direct SDK harness
process instead
 > > of Docker containers. For our specific needs it will works
equally well
 > > for development and production, including future plans to
deploy Flink
 > > TMs via Kubernetes.
 > >
 > > Thanks,
 > > Thomas
 > >
 > > [1]
 > >

https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E
 > >
 > >
 > >
 > >
 > >
 > >
 > > On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels
mailto:m...@apache.org>
 > > <mailto:m..

Re: Process JobBundleFactory for portable runner

Just to recap:

From this and the other thread ("Bootstraping Beam's Job Server") we 
got sufficient evidence that process-based execution is a desired feature.

Process-based execution as an alternative to dockerized execution
https://issues.apache.org/jira/browse/BEAM-5187

Which parts are executed as a process?
=> The SDK harness for user code

What configuration options are supported?
=> Provide information about the target architecture (OS/CPU)
=> Staging libraries, as also supported by Docker
=> Activating a pre-existing environment (e.g. virutalenv)

On 23.08.18 14:13, Maximilian Michels wrote:
One thing to consider that we've talked about in the past. It might 
make sense to extend the environment proto and have the SDK be 
explicit about which kinds of environment it support

+1 Encoding environment information there is a good idea.

Seems it will create a default docker url even if the 
hardness_docker_image is set to None in pipeline options. Shall we add 
another option or honor the None in this option to support the process 
job? 

Yes, if no Docker image is set the default one will be used. Currently 
Docker is the only way to execute pipelines with the PortableRunner. If 
the docker_image is not set, execution won't succeed.

On 22.08.18 22:59, Xinyu Liu wrote:
We are also interested in this Process JobBundleFactory as we are 
planning to fork a process to run python sdk in Samza runner, instead 
of using docker container. So this change will be helpful to us too. 
On the same note, we are trying out portable_runner.py to submit a 
python job. Seems it will create a default docker url even if the 
hardness_docker_image is set to None in pipeline options. Shall we add 
another option or honor the None in this option to support the process 
job? I made some local changes right now to walk around this.

Thanks,
Xinyu

On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <mailto:hero...@google.com>> wrote:

    By "enum" in quotes, I meant the usual open URN style pattern not an
    actual enum. Sorry if that wasn't clear.

    On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik mailto:lc...@google.com>> wrote:

    I would model the environment to be more free form then enums
    such that we have forward looking extensibility and would
    suggest to follow the same pattern we use on PTransforms (using
    an URN and a URN specific payload). Note that in this case we
    may want to support a list of supported environments (e.g. java,
    docker, python, ...).

    On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
    mailto:hero...@google.com>> wrote:

    One thing to consider that we've talked about in the past.
    It might make sense to extend the environment proto and have
    the SDK be explicit about which kinds of environment it
    supports:

https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 

<https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969> 

    This choice might impact what files are staged or what not.
    Some SDKs, such as Go, provide a compiled binary and _need_
    to know what the target architecture is. Running on a mac
    with and without docker, say, requires a different worker in
    each case. If we add an "enum", we can also easily add the
    external idea where the SDK/user starts the SDK harnesses
    instead of the runner. Each runner may not support all types
    of environments.

    Henning

    On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
    mailto:m...@apache.org>> wrote:

    For reference, here is corresponding JIRA issue for this
    thread:
    https://issues.apache.org/jira/browse/BEAM-5187
    <https://issues.apache.org/jira/browse/BEAM-5187>

    On 16.08.18 11:15, Maximilian Michels wrote:
 > Makes sense to have an option to run the SDK harness
    in a non-dockerized
 > environment.
 >
 > I'm in the process of creating a Docker entry point
    for Flink's
 > JobServer[1]. I suppose you would also prefer to
    execute that one
 > standalone. We should make sure this is also an 
option.

 >
 > [1] https://issues.apache.org/jira/browse/BEAM-4130
    <https://issues.apache.org/jira/browse/BEAM-4130>
 >
 > On 16.08.18 07:42, Thomas Weise wrote:
 >> Yes, that's the proposal. Everything that would

Status of IntelliJ with Gradle

Hi Beamers,

It's great to see the Beam build system overhauled. Thank you for all
the hard work.

That said, I've just started contributing to Beam again and I feel
really stupid for not having a fully-functional IDE. I've closely
followed the IntelliJ/Gradle instructions [1]. In the terminal
everything works fine.

First of all, I get warnings like the following and the build fails:


.../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java:29:
warning: [deprecation] NonNull in edu.umd.cs.findbugs.annotations has
been deprecated
@DefaultAnnotation(NonNull.class)
   ^
error: warnings found and -Werror specified
1 error
89 warnings
=

Somehow the "-Xlint:-deprecation" compiler flag does not get through but
"-Werror" does. I can get it to compile by removing the "-Werror" flag
from BeamModulePlugin but that's obviously not the solution.

Further, once the build succeeds I still have to add the relocated
Protobuf library manually because the one in "vendor" does not get
picked up. I've tried clearing caches / re-adding the project /
upgrading IntelliJ / changing Gradle configs.


Is this just me or do you also have similar problems? If so, I would
like to compile a list of possible fixes to help other contributors.


Thanks,
Max


Tested with
- IntelliJ 2018.1.6 and 2018.2.1.
- MacOS
- java version "1.8.0_112"

[1] https://beam.apache.org/contribute/intellij/

Status of IntelliJ with Gradle

Hi Beamers,

It's great to see the Beam build system overhauled. Thank you for all
the hard work.

That said, I've just started contributing to Beam again and I feel
really stupid for not having a fully-functional IDE. I've closely
followed the IntelliJ/Gradle instructions [1]. In the terminal
everything works fine.

First of all, I get warnings like the following and the build fails:


.../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java:29:
warning: [deprecation] NonNull in edu.umd.cs.findbugs.annotations has
been deprecated
@DefaultAnnotation(NonNull.class)
   ^
error: warnings found and -Werror specified
1 error
89 warnings
=

Somehow the "-Xlint:-deprecation" compiler flag does not get through but
"-Werror" does. I can get it to compile by removing the "-Werror" flag
from BeamModulePlugin but that's obviously not the solution.

Further, once the build succeeds I still have to add the relocated
Protobuf library manually because the one in "vendor" does not get
picked up. I've tried clearing caches / re-adding the project /
upgrading IntelliJ / changing Gradle configs.


Is this just me or do you also have similar problems? If so, I would
like to compile a list of possible fixes to help other contributors.


Thanks,
Max


Tested with
- IntelliJ 2018.1.6 and 2018.2.1.
- MacOS
- java version "1.8.0_112"

[1] https://beam.apache.org/contribute/intellij/

Re: Status of IntelliJ with Gradle

Sorry, please disregard this duplicate mail. The Apache mail relay was
flaky and my client doesn't seem to handle it particularly well.

On 20.08.18 15:51, Maximilian Michels wrote:
> Hi Beamers,
> 
> It's great to see the Beam build system overhauled. Thank you for all
> the hard work.
> 
> That said, I've just started contributing to Beam again and I feel
> really stupid for not having a fully-functional IDE. I've closely
> followed the IntelliJ/Gradle instructions [1]. In the terminal
> everything works fine.
> 
> First of all, I get warnings like the following and the build fails:
> 
> 
> .../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java:29:
> warning: [deprecation] NonNull in edu.umd.cs.findbugs.annotations has
> been deprecated
> @DefaultAnnotation(NonNull.class)
>^
> error: warnings found and -Werror specified
> 1 error
> 89 warnings
> =
> 
> Somehow the "-Xlint:-deprecation" compiler flag does not get through but
> "-Werror" does. I can get it to compile by removing the "-Werror" flag
> from BeamModulePlugin but that's obviously not the solution.
> 
> Further, once the build succeeds I still have to add the relocated
> Protobuf library manually because the one in "vendor" does not get
> picked up. I've tried clearing caches / re-adding the project /
> upgrading IntelliJ / changing Gradle configs.
> 
> 
> Is this just me or do you also have similar problems? If so, I would
> like to compile a list of possible fixes to help other contributors.
> 
> 
> Thanks,
> Max
> 
> 
> Tested with
> - IntelliJ 2018.1.6 and 2018.2.1.
> - MacOS
> - java version "1.8.0_112"
> 
> [1] https://beam.apache.org/contribute/intellij/
>

Re: Status of IntelliJ with Gradle

2018-08-30 Thread Maximilian Michels

Small update, it helps to add the following to the IntelliJ properties:

Help -> Edit Custom Properties

idea.max.intellisense.filesize=5000

This gets rid of the errors due to large generated source files, e.g. 
RunnerApi.java.

-Max

On 22.08.18 23:26, Kai Jiang wrote:

I encountered same error with Xinyu when I was launching unit tests in 
Intellij. For now, I am only using gradle to test unit tests.

Thanks,
Kai

On 2018/08/22 21:11:06, Xinyu Liu  wrote:

We experienced the same issues too in intellij after switching to latest
version. I did the trick Luke mentioned before to include the
beam-model-fn-execution and beam-model-job-management jars in the dependent
modules to get around compilation. But I cannot get the vendored protobuf
working. Seems the RunnerApi is using the original protobuf package, and it
causes confusion in intellij if I added the relocated jar. As a result, I
have to run and debug only using gradle for now.

Thanks,
Xinyu

On Wed, Aug 22, 2018 at 1:45 AM, Maximilian Michels  wrote:

Thanks Lukasz. I also found that I can never fix all import errors by
manually adding jars to the IntelliJ library list. It is also not a good
solution because it breaks on reloading the Gradle project.

New contributors might find the errors in IntelliJ distracting. Even
worse, they might assume the problem is on their side. If we can't fix them
soon, I'd suggest documenting the IntelliJ limitations in the contributor
guide.

On 20.08.18 17:58, Lukasz Cwik wrote:

Yes, I have the same issues with vendoring. These are the things that I
have tried without success to get Intellij to import the vendored modules
correctly:
* attempted to modify the idea.module.scopes to only include the vendored
artifacts (for some reason this is ignored and Intellij is relying on the
output of its own internal module, nothing I add to the scopes seems to
impact anything)
* modify the generated iml beforehand to add the vendored jar file as the
top dependency (jar never appears in the modules dependencies)

On Mon, Aug 20, 2018 at 8:36 AM Maximilian Michels mailto:m...@apache.org>> wrote:

 Thank you Etienne for opening the issue.

 Anyone else having problems with the shaded Protobuf dependency?

 On 20.08.18 16:14, Etienne Chauchot wrote:
  > Hi Max,
  >
  > I experienced the same, I had first opened a general ticket
  > (https://issues.apache.org/jira/browse/BEAM-4418) about gradle
  > improvements and I just split it in several tickets. Here is the
one
  > concerning the same issue:
 https://issues.apache.org/jira/browse/BEAM-5176
  >
  > Etienne
  >
  > Le lundi 20 août 2018 à 15:51 +0200, Maximilian Michels a écrit :
  >> Hi Beamers,
  >>
  >> It's great to see the Beam build system overhauled. Thank you
 for all
  >> the hard work.
  >>
  >> That said, I've just started contributing to Beam again and I feel
  >> really stupid for not having a fully-functional IDE. I've closely
  >> followed the IntelliJ/Gradle instructions [1]. In the terminal
  >> everything works fine.
  >>
  >> First of all, I get warnings like the following and the build
fails:
  >>
  >> 
  >>
 .../beam/sdks/java/core/src/main/java/org/apache/beam/sdk/pa
ckage-info.java:29:
  >> warning: [deprecation] NonNull in
 edu.umd.cs.findbugs.annotations has
  >> been deprecated
  >> @DefaultAnnotation(NonNull.class)
  >>^
  >> error: warnings found and -Werror specified
  >> 1 error
  >> 89 warnings
  >> =
  >>
  >> Somehow the "-Xlint:-deprecation" compiler flag does not get
 through but
  >> "-Werror" does. I can get it to compile by removing the
 "-Werror" flag
  >> from BeamModulePlugin but that's obviously not the solution.
  >>
  >> Further, once the build succeeds I still have to add the relocated
  >> Protobuf library manually because the one in "vendor" does not get
  >> picked up. I've tried clearing caches / re-adding the project /
  >> upgrading IntelliJ / changing Gradle configs.
  >>
  >>
  >> Is this just me or do you also have similar problems? If so, I
would
  >> like to compile a list of possible fixes to help other
contributors.
  >>
  >>
  >> Thanks,
  >> Max
  >>
  >>
  >> Tested with
  >> - IntelliJ 2018.1.6 and 2018.2.1.
  >> - MacOS
  >> - java version "1.8.0_112"
  >>
  >> [1] https://beam.apache.org/contribute/intellij/
  >>
  >>

--
Max

Re: Python 3: final step

2018-09-07 Thread Maximilian Michels

This has been requested multiple times. Thanks for working on the Python 
3 story.


Let me know if I can help out in any way!

On 05.09.18 19:01, Valentyn Tymofieiev wrote:
This is awesome! Kudos to Robbe and Matthias who have been pushing this 
forward!


On Wed, Sep 5, 2018 at 9:45 AM Charles Chen > wrote:


This is great!  Feel free to add me as a reviewer.

On Wed, Sep 5, 2018 at 9:38 AM Andrew Pilloud mailto:apill...@google.com>> wrote:

Cool! I know very little about Python 3, but happy to help review.

Andrew

On Wed, Sep 5, 2018 at 9:21 AM Ahmet Altay mailto:al...@google.com>> wrote:

Thank you Robbe, this is great news!

On Wed, Sep 5, 2018 at 9:11 AM, Robbe Sneyders
mailto:robbe.sneyd...@ml6.eu>> wrote:

Hi everyone,

With the merging of [1], we now have Python 3 tests
running on Jenkins, which allows us to move forward with
the last step of the Python 3 porting.

You can follow the progress on the Jira Kanban Board
[2]. If you're interested in helping by porting a
module, you can assign one of the issues to yourself and
start coding. You can find the different steps outlined
in the design document [3].

We could also use some extra reviewers. If you're
interested, let us know, and we'll tag you in our PRs.

[1] https://github.com/apache/beam/pull/6266
[2]

https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245
[3] https://s.apache.org/beam-python-3

kind regards,
Robbe
-- 


https://ml6.eu 




*Robbe Sneyders*

ML6 Gent



M: +32 474 71 31 08

Re: [NEW CONTRIBUTOR] ElasticsearchIO now supports Elasticsearch v6.x

2018-09-07 Thread Maximilian Michels


Well done. Thank you, Dat!

On 06.09.18 22:47, Trần Thành Đạt wrote:
Thank you. Etienne Chauchot and Tim Robertson helped me a lot to get 
familiar with Beam code.


On Fri, Sep 7, 2018 at 2:59 AM Thomas Weise > wrote:


Support for Elastic 6.x is really good to have. Thanks Dat and
welcome to the Beam community!

Thomas

On Thu, Sep 6, 2018 at 9:18 PM Ahmet Altay mailto:al...@google.com>> wrote:

Welcome! Thank you!

On Thu, Sep 6, 2018 at 11:40 AM, Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Welcome Dat Tran and thanks for the contribution.

On Thu, Sep 6, 2018 at 1:17 AM Alexey Romanenko
mailto:aromanenko@gmail.com>>
wrote:

Welcome to Beam community, Dat Tran! Thank you for your
work!


On 6 Sep 2018, at 10:10, Etienne Chauchot
mailto:echauc...@apache.org>>
wrote:

Hi all ,

Just to let you know that Elasticsearch IO now
supports version 6 in addition to version 2 and 5 now
that this PR is merged:

https://github.com/apache/beam/pull/6211#pullrequestreview-152477892.

At this occasion we gained a new contributor: Dat Tran

Best
Etienne

Re: PR/6343: Adding support for MustFollow

This is a great idea but I share Lukasz' doubts about this being a 
universal solution for awaiting some action in a pipeline.


I wonder, wouldn't it work to not pass in a PCollection, but instead 
wrap a DoFn which internally ensures the correct triggering behavior? 
All runners which correctly materialize the side input with the first 
window triggering should support it correctly.


Apart from that, couldn't you simply use the @Setup method of a DoFn in 
your example?


-Max

On 07.09.18 23:12, Peter Li wrote:

Thanks!  I (PR author) agree with all that.

On the unbounded triggering issue, I can see 2 reasonable desired behaviors:
   1) The collection to follow is bounded and the intent is to wait for 
the entire collection to be processed.
   2) The collection to follow has windows that in some flexible sense 
align with the windows of the collection that is supposed to be waiting, 
and the intent is for the waiting to apply within window.


If either or both of these behaviors can be supported by something like 
the proposed mechanism, then I think that's a reasonable thing to have 
provided it's well documented.


The other high-level issue I see is whether this should be A) a free 
PTransform on collections, or B) something like a call on ParDo.  In the 
PR it's (A), but I can imagine for future more native implementation on 
some runners it could be more natural as (B).


On Fri, Sep 7, 2018 at 11:58 AM Lukasz Cwik > wrote:


A contributor opened a PR[1] to add support for a PTransform that
forces one PTransform to be executed before another by using side
input readiness as a way to defer execution.

They have provided this example usage:
# Ensure that output dir is created before attempting to write
output files.
output_dir_ready = pipeline | beam.Create([output_path]) |
CreateOutputDir()
output_pcoll | MustFollow(output_dir_ready) | WriteOutput()

The example the PR author provided works since they are working in
the global window with no triggers that have early firings and they
are only processing a single element so `CreateOutputDir()` is
invoked only once. The unused side input access forces all runners
to wait till `CreateOutputDir()` is invoked before `WriteOutput()`.

Now imagine a user wanted to use `MustFollow` but incorrectly setup
the trigger for the side input and it has an early firing (such as
an after count 1 trigger) and has multiple directories they want to
create. The side input would become ready before **all** of
`CreateOutputDir()` was invoked. This would mean that `MustFollow`
would be able to access the side input and hence allow
`WriteOutput()` to happen. The example above is still a bounded
pipeline and a runner could choose to execute it as I described.

The contract for side inputs is that the side input PCollection must
have at least one firing based upon the upstream windowing strategy
for the "requested" window or the runner must be sure that the no
such window could ever be produced before it can be accessed by a
consumer.

My concern with the PR is two fold:
1) I'm not sure if this works for all runners in both bounded and
unbounded pipelines and could use insights from others:

I believe if the user is using a trigger which only fires once and
that they guarantee that the main input window mapping to side input
window mapping is 1:1 then they are likely to get the expected
behavior, WriteOutput() always will happen once CreateOutputDir()
for a given window has finished.

2) If the solution works, the documentation around the limitations
of how the PTransform works needs a lot more detail and potentially
"error" checking of the windowing strategy.

1: https://github.com/apache/beam/pull/6343

Re: SplittableDoFn


Thanks for moving forward with this, Lukasz!

Unfortunately, can't make it on Friday but I'll sync with somebody on 
the call (e.g. Ryan) about your discussion.


On 08.09.18 02:00, Lukasz Cwik wrote:
Thanks for everyone who wanted to fill out the doodle poll. The most 
popular time was Friday Sept 14th from 11am-noon PST. I'll send out a 
calendar invite and meeting link early next week.


I have received a lot of feedback on the document and have addressed 
some parts of it including:

* clarifying terminology
* processing skew due to some restrictions having their watermarks much 
further behind then others affecting scheduling of bundles by runners
* external throttling & I/O wait overhead reporting to make sure we 
don't overscale


Areas that still need additional feedback and details are:
* reporting progress around the work that is done and is active
* more examples
* unbounded restrictions being caused by an unbounded number of splits 
of existing unbounded restrictions (infinite work growth)
* whether we should be reporting this information at the PTransform 
level or at the bundle level




On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik > wrote:


Thanks to all those who have provided interest in this topic by the
questions they have asked on the doc already and for those
interested in having this discussion. I have setup this doodle to
allow people to provide their availability:
https://doodle.com/poll/nrw7w84255xnfwqy

I'll send out the chosen time based upon peoples availability and a
Hangout link by end of day Friday so please mark your availability
using the link above.

The agenda of the meeting will be as follows:
* Overview of the proposal
* Enumerate and discuss/answer questions brought up in the meeting

Note that all questions and any discussions/answers provided will be
added to the doc for those who are unable to attend.

On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

+1

Regards
JB
Le 31 août 2018, à 18:22, Lukasz Cwik mailto:lc...@google.com>> a écrit:

That is possible, I'll take people's date/time suggestions
and create a simple online poll with them.

On Fri, Aug 31, 2018 at 2:22 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:

Thanks for taking this up. I added some comments to the
doc. A European-friendly time for discussion would
be great.

On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:

I came up with a proposal[1] for a progress model
solely based off of the backlog and that splits
should be based upon the remaining backlog we want
the SDK to split at. I also give recommendations to
runner authors as to how an autoscaling system could
work based upon the measured backlog. A lot of
discussions around progress reporting and splitting
in the past has always been around finding an
optimal solution, after reading a lot of information
about work stealing, I don't believe there is a
general solution and it really is upto
SplittableDoFns to be well behaved. I did not do
much work in classifying what a well behaved
SplittableDoFn is though. Much of this work builds
off ideas that Eugene had documented in the past[2].

I could use the communities wide knowledge of
different I/Os to see if computing the backlog is
practical in the way that I'm suggesting and to
gather people's feedback.

If there is a lot of interest, I would like to hold
a community video conference between Sept 10th and
14th about this topic. Please reply with your
availability by Sept 6th if your interested.

1: https://s.apache.org/beam-bundles-backlog-splitting
2: https://s.apache.org/beam-breaking-fusion

On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste
Onofré mailto:j...@nanthrax.net>> wrote:

Awesome !

Thanks Luke !

I plan to work with you and others on this one.

Regards
JB
Le 13 août 2018, à 19:14, Lukasz Cwik
mailto:lc...@google.com>> a
écrit:

I wanted to reach out that I will be
continuing from where Eugene left off

Re: [FYI] Paper of Building Beam Runner for IBM Streams


Excellent write-up. Thank you!

On 09.09.18 20:43, Jean-Baptiste Onofré wrote:

Good idea. It could also help people who wants to create runners.

Regards
JB

On 09/09/2018 13:00, Manu Zhang wrote:

Hi all,

I've spent the weekend reading Challenges and Experiences in Building an
Efficient Apache Beam Runner For IBM Streams
 from the August
proceeding of PVLDB. It's quite enjoyable and urges me to reflect on how
I (should've) implemented the Gearpump runner. I believe it will be
beneficial to have more such papers and discussions as sharing design
choices and lessons from various runners.

Enjoy it !
Manu Zhang

Re: [Call for items] September Beam Newsletter


Good stuff! Left some items for the Flink Runner.

On 08.09.18 02:14, Rose Nguyen wrote:

*bump*

Celebrate the weekend by sharing with the community your talks, 
contributions, plans, etc!


On Wed, Sep 5, 2018 at 10:25 AM Rose Nguyen > wrote:


Hi Beamers:

Here's


[1] the template for the September Beam Newsletter!

*Add the highlights from August to now (or planned events and talks)
that you want to share with the community by 9/8 11:59 p.m. **PDT.*
*
*
We will collect the notes via Google docs but send out the final
version directly to the user mailing list. If you do not know how to
format something, it is OK to just put down the info and I will
edit. I'll ship out the newsletter on 9/10.

Looking forward to your interesting stories. :)

[1]

https://docs.google.com/document/d/1PE97Cf3yoNcx_A9zPzROT_kPtujRZLJoZyqIQfzqUvY/edit
-- 
Rose Thị Nguyễn




--
Rose Thị Nguyễn

Re: PTransforms and Fusion


A) What should we do with these "empty" PTransforms?


We can't translate them, so dropping them seems the most reasonable 
choice. Should we throw an error/warning to make the user aware of this? 
Otherwise might be unexpected for the user.



A3) Handle the "empty" PTransform case within all of the shared libraries.


What can we do at this point other than dropping them?


B) What should we do with "native" PTransforms?


I support B1 and B2 as well. Non-portable PTransforms should be 
discouraged in the long run. However, the available PTransforms are not 
even consistent across the different SDKs yet (e.g. no streaming 
connectors in Python), so we should continue to provide a way to utilize 
the "native" transforms of a Runner.


-Max

On 07.09.18 19:15, Lukasz Cwik wrote:
A primitive transform is a PTransform that has been chosen to have no 
default implementation in terms of other PTransforms. A primitive 
transform therefore must be implemented directly by a pipeline runner in 
terms of pipeline-runner-specific concepts. An initial list of primitive 
PTransforms were defined in [2] and has since been updated in [3].


As part of the portability effort, libraries that are intended to be 
shared across multiple runners are being developed to support their 
migration to a portable execution model. One of these is responsible for 
fusing multiple primitive PTransforms together into a pipeline runner 
specific concept. This library made the choice that a primitive 
PTransform is a PTransform that doesn't contain any other PTransforms.


Unfortunately, while Ryan was attempting to enable testing of validates 
runner tests for Flink using the new portability libraries, he ran into 
an issue where the Apache Beam Java SDK allows for a person to construct 
a PTransform that has zero sub PTransforms and also isn't one of the 
defined Apache Beam primitives. In this case the PTransform was trivial 
as it was not applying any additional transforms to input PCollection 
and just returning it. This caused an issue within the portability 
libraries since they couldn't handle this structure.


To solve this issue, I had proposed that we modify the portability 
library that does fusion to use a whitelist of primitives preventing the 
issue from happening. This solved the problem but caused an issue for 
Thomas as he was relying on this behaviour of PTransforms with zero sub 
transforms being primitives. Thomas has a use-case where he wants to 
expose the internal Flink Kafka and Kinesis connectors and to build 
Apache Beam pipelines that use the Flink native sources/sinks. I'll call 
these "native" PTransforms, since they aren't part of the Apache Beam 
model and are runner specific.


This brings up two topics:
A) What should we do with these "empty" PTransforms?
B) What should we do with "native" PTransforms?

The typical flow of a pipeline representation for a portable pipeline is:
language specific representation -> proto representation -> job service 
-> shared libraries that simplify/replace the proto representation with 
a simplified version (e.g. fusion) -> runner specific conversion to 
native runner concepts (e.g. GBK -> runner implementation of GBK)


--

A) What should we do with these "empty" PTransforms?

To give a little more detail, these transforms typically can happen if 
people have conditional logic such as loops that would perform an 
expansion but do nothing if the condition is immediately unsatisfied. So 
allowing for PTransforms that are empty is useful when building a pipeline.


What should we do:
A1) Stick with the whitelist of primitive PTransforms.
A2) When converting the pipeline from language specific representation 
into the proto representation, drop any "empty" PTransforms. This means 
that the pipeline representation that is sent to the runner doesn't 
contain the offending type of PTransform and the shared libraries 
wouldn't have to change.

A3) Handle the "empty" PTransform case within all of the shared libraries.

I like doing both A1 and A2. A1 since it helps simplify the shared 
libraries since we know the whole list of primitives we need to 
understand and A2 because it removes noise within the pipeline shape 
from its representation.


--

B) What should we do with "native" PTransforms?

Some approaches that we could take as a community:

B1) Prevent the usage of "native" PTransforms within Apache Beam since 
they hurt portability of pipelines across runners. This can be done by 
specifically using whitelists of allowed primitive PTransforms in the 
shared libraries and explicitly not allowing for shared libraries to 
have extension points customizing this.


B2) We embrace that closed systems internal to companies will want to 
use their own extensions and enable support for "native" PTransforms but 
actively discourage "native" PTransforms in the open ecosystem.


B3) We embrace and allow for "native" PTransforms in the open ecosystem.

Re: builds.apache.org refused connections since last night

2018-08-31 Thread Maximilian Michels


Jenkins is up again! (woho!)

On 30.08.18 20:23, Thomas Weise wrote:
I would be concerned with multiple folks running the Jekyll build 
locally to end up with inconsistent results. But if Jenkins stays down 
for longer, then maybe one of us can be the Jenkins substitute :)



On Thu, Aug 30, 2018 at 10:52 AM Boyuan Zhang > wrote:


Hey Thomas,

I guess a comitter can push changes directly into
https://gitbox.apache.org/repos/asf?p=beam-site.git. Maybe can have
a try with a comitter's help.

On Thu, Aug 30, 2018 at 10:34 AM Thomas Weise mailto:t...@apache.org>> wrote:

While Jenkins is down, is there an alternative process to merge
web site changes?


On Wed, Aug 29, 2018 at 9:19 AM Boyuan Zhang mailto:boyu...@google.com>> wrote:

Thank you Andrew!

On Wed, Aug 29, 2018 at 9:17 AM Andrew Pilloud
mailto:apill...@google.com>> wrote:

Down for me too. It sounds like the disk failed and it
will be down for a while: https://status.apache.org/

Andrew

On Wed, Aug 29, 2018 at 9:13 AM Boyuan Zhang
mailto:boyu...@google.com>> wrote:

Hey all,

It seems like that builds.apache.org
 cannot be reached from
last night. Does anyone have the sam e connection
problem? Any idea what can we do?

Boyuan Zhang

Re: Beam Schemas: current status

2018-08-31 Thread Maximilian Michels

Thanks Reuven. That's an OK restriction. Apache Flink also requires 
non-final fields to be able to generate TypeInformation (~=Schema) from 
PoJos.

I agree that it's not very intuitive for Users.

I suppose it would work to assume a constructor with the same parameter 
order as the fields in the class. So if instantiation with the default 
constructor doesn't work, it would try to look up a constructor based on 
the fields of the class.

Perhaps too much magic, having a dedicated interface for construction is 
a more programmatic approach.

-Max

On 30.08.18 16:55, Reuven Lax wrote:

Max,

Nested Pojos are fully supported, as are nested array/collection and map 
types (e.g. if your Pojo contains List).

One limitation right now is that only mutable Pojos are supported. For 
example, the following Pojo would _not_ work, because the fields aren't 
mutable.

public class Pojo {
   public final String field;
}

This is an annoying restriction, because in practice Pojo types often 
have final fields. The reason for the restriction is that the most 
general way to create an instance of this Pojo (after decoding) is to 
instantiate the object and then set the fields one by one (I also assume 
that there's a default constructor).  I can remove this restriction if 
there is an appropriate constructor or builder interface that lets us 
construct the object directly.

Reuven

On Thu, Aug 30, 2018 at 6:51 AM Maximilian Michels <mailto:m...@apache.org>> wrote:

That's a cool feature. Are there any limitations for the schema
inference apart from being a Pojo/Bean? Does it supported nested PoJos,
e.g. "wrapper.field"?

-Max

On 29.08.18 07:40, Reuven Lax wrote:
 > I wanted to send a quick note to the community about the current
status
 > of schema-aware PCollections in Beam. As some might remember we
had a
 > good discussion last year about the design of these schemas,
involving
 > many folks from different parts of the community. I sent a summary
 > earlier this year explaining how schemas has been integrated into
the
 > DoFn framework. Much has happened since then, and here are some
of the
 > highlights.
 >
 >
 > First, I want to emphasize that all the schema-aware classes are
 > currently marked @Experimental. Nothing is set in stone yet, so
if you
 > have questions about any decisions made, please start a discussion!
 >
 >
 >       SQL
 >
 > The first big milestone for schemas was porting all of BeamSQL to
use
 > the framework, which was done in pr/5956. This was a lot of work,
 > exposed many bugs in the schema implementation, but now provides
great
 > evidence that schemas work!
 >
 >
 >       Schema inference
 >
 > Beam can automatically infer schemas from Java POJOs (objects with
 > public fields) or JavaBean objects (objects with getter/setter
methods).
 > Often you can do this by simply annotating the class. For example:
 >
 >
 > @DefaultSchema(JavaFieldSchema.class)
 >
 > publicclassUserEvent{
 >
 > publicStringuserId;
 >
 > publicLatLonglocation;
 >
 > PublicStringcountryCode;
 >
 > publiclongtransactionCost;
 >
 > publicdoubletransactionDuration;
 >
 > publicListtraceMessages;
 >
 > };
 >
 >
 > @DefaultSchema(JavaFieldSchema.class)
 >
 > publicclassLatLong{
 >
 > publicdoublelatitude;
 >
 > publicdoublelongitude;
 >
 > }
 >
 >
 > Beam will automatically infer schemas for these classes! So if
you have
 > a PCollection, it will automatically get the following
schema:
 >
 >
 > UserEvent:
 >
 >   userId: STRING
 >
 >   location: ROW(LatLong)
 >
 >   countryCode: STRING
 >
 >   transactionCost: INT64
 >
 >   transactionDuration: DOUBLE
 >
 >   traceMessages: ARRAY[STRING]]
 >
 >
 > LatLong:
 >
 >   latitude: DOUBLE
 >
 >   longitude: DOUBLE
 >
 >
 > Now it’s not always possible to annotate the class like this (you
may
 > not own the class definition), so you can also explicitly
register this
 > using Pipeline:getSchemaRegistry:registerPOJO, and the same for
JavaBeans.
 >
 >
 >       Coders
 >
 > Beam has a built-in coder for any schema-aware PCollection, largely
 > removing the need for users to care about coders. We generate
low-level
 > bytecode (using ByteBuddy) to implement the coder for each
schema, so
 > these coders are quite performant

Re: Beam Schemas: current status

2018-08-31 Thread Maximilian Michels

Good point with identical types. You definitely want to avoid the following:

class Pojo {
  final String param1;
  final String param2;

  Pojo(String param2, String param1) {
this.param1 = param1;
this.param2 = param2;
  }
}

This would change the Pojo after deserialization. So this should only do 
its magic if there is only one possible way to feed data to the 
constructor. That's why a dedicated interface would be the easier and 
safer way to opt-in.

On 31.08.18 11:27, Robert Bradshaw wrote:
On Fri, Aug 31, 2018 at 11:22 AM Maximilian Michels <mailto:m...@apache.org>> wrote:

Thanks Reuven. That's an OK restriction. Apache Flink also requires
non-final fields to be able to generate TypeInformation (~=Schema) from
PoJos.

I agree that it's not very intuitive for Users.

I suppose it would work to assume a constructor with the same parameter
order as the fields in the class. So if instantiation with the default
constructor doesn't work, it would try to look up a constructor
based on
the fields of the class.

I think this would make a lot of sense, but it would require some 
assumptions (e.g. the declared field order is the same as the 
constructor argument order (and/or the schema order), especially if 
there are fields of the same type). Probably still worth doing, either 
under a more limited set of constraints (all fields are of a different 
type), or as opt-in.

Perhaps too much magic, having a dedicated interface for
construction is
a more programmatic approach.

-Max

On 30.08.18 16:55, Reuven Lax wrote:
 > Max,
 >
 > Nested Pojos are fully supported, as are nested array/collection
and map
 > types (e.g. if your Pojo contains List).
 >
 > One limitation right now is that only mutable Pojos are
supported. For
 > example, the following Pojo would _not_ work, because the fields
aren't
 > mutable.
 >
 > public class Pojo {
 >    public final String field;
 > }
 >
 > This is an annoying restriction, because in practice Pojo types
often
 > have final fields. The reason for the restriction is that the most
 > general way to create an instance of this Pojo (after decoding)
is to
 > instantiate the object and then set the fields one by one (I also
assume
 > that there's a default constructor).  I can remove this
restriction if
 > there is an appropriate constructor or builder interface that
lets us
 > construct the object directly.
 >
 > Reuven
 >
 > On Thu, Aug 30, 2018 at 6:51 AM Maximilian Michels
mailto:m...@apache.org>
 > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >     That's a cool feature. Are there any limitations for the schema
 >     inference apart from being a Pojo/Bean? Does it supported
nested PoJos,
 >     e.g. "wrapper.field"?
 >
 >     -Max
 >
 >     On 29.08.18 07:40, Reuven Lax wrote:
 >      > I wanted to send a quick note to the community about the
current
 >     status
 >      > of schema-aware PCollections in Beam. As some might
remember we
 >     had a
 >      > good discussion last year about the design of these schemas,
 >     involving
 >      > many folks from different parts of the community. I sent a
summary
 >      > earlier this year explaining how schemas has been
integrated into
 >     the
 >      > DoFn framework. Much has happened since then, and here are
some
 >     of the
 >      > highlights.
 >      >
 >      >
 >      > First, I want to emphasize that all the schema-aware
classes are
 >      > currently marked @Experimental. Nothing is set in stone
yet, so
 >     if you
 >      > have questions about any decisions made, please start a
discussion!
 >      >
 >      >
 >      >       SQL
 >      >
 >      > The first big milestone for schemas was porting all of
BeamSQL to
 >     use
 >      > the framework, which was done in pr/5956. This was a lot
of work,
 >      > exposed many bugs in the schema implementation, but now
provides
 >     great
 >      > evidence that schemas work!
 >      >
 >      >
 >      >       Schema inference
 >      >
 >      > Beam can automatically infer schemas from Java POJOs
(objects with
 >      > public fields) or JavaBean objects (objects with getter/setter
 >     methods).
 >      > Often you can do this by simply annotating the class. For
example:
 >

Re: Gradle Races in beam-examples-java, beam-runners-apex

2018-09-11 Thread Maximilian Michels

Do we have inotifywait available on Travis and could set it up to log 
concurrent access to the relevant Jar files?


On 10.09.18 22:41, Lukasz Cwik wrote:
I had originally suggested to use some Linux kernel tooling such as 
inotifywait[1] to watch what is happening.


It is likely that we have some Gradle task which is running something in 
parallel to a different Gradle task when it shouldn't which means that 
the jar file is being changed/corrupted. I believe fixing our Gradle 
task dependency tree wrt to this would solve the problem. This crash 
does not reproduce on my desktop after 20 runs which makes it hard for 
me to test for.


1: https://www.linuxjournal.com/content/linux-filesystem-events-inotify

On Mon, Sep 10, 2018 at 1:15 PM Ryan Williams > wrote:


this continues to be an issue locally (cf. some discussion in #beam
slack)

commands like `./gradlew javaPreCommit` or `./gradlew build`
reliably fail with a range of different


JVM crashes


in a few different tasks, with messages that suggest filing a bug
against the Java compiler

what do we know about the actual race condition that is allowing one
task to attempt to read from a JAR that is being overwritten by
another task? presumably this is just a bug in our Gradle configs?

On Mon, Aug 27, 2018 at 2:28 PM Andrew Pilloud mailto:apill...@google.com>> wrote:

It appears that there is no one working on a fix for the flakes,
so I've merged the change to disable parallel tasks on precommit.

Andrew

On Fri, Aug 24, 2018 at 1:30 PM Andrew Pilloud
mailto:apill...@google.com>> wrote:

I'm seeing failures due to this on 12 of the last 16
PostCommits. Precommits take about 22 minutes run in
parallel, so at a 25% pass rate that puts the expected time
to a good test run at 264 minutes assuming you immediately
restart on each failure. We are looking at 56 minutes for a
precommit that isn't run in parallel:
https://builds.apache.org/job/beam_PreCommit_Java_Phrase/266/ I'd
rather have tests take a little longer then have to monitor
them for several hours.

I've opened a PR: https://github.com/apache/beam/pull/6274

Andrew

On Fri, Aug 24, 2018 at 10:47 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:

I believe it would mitigate the issue but also make the
jobs take much longer to complete.

On Thu, Aug 23, 2018 at 2:44 PM Andrew Pilloud
mailto:apill...@google.com>> wrote:

There seems to be a misconfiguration of gradle that
is causing a high rate of failure for the last
several weeks in building beam-examples-java and
beam-runners-apex. It appears to be some sort of
race condition in building dependencies. Given that
no one has made progress on fixing the root cause,
is this something we could mitigate by running jobs
with `--no-parallel` flag?

https://issues.apache.org/jira/browse/BEAM-5035
https://issues.apache.org/jira/browse/BEAM-5207

Andrew

Re: Is there any way to ask the runner to call finalizeCheckpoint() method before it closed the Reader?

2018-10-05 Thread Maximilian Michels

Restoring from a checkpoint is something different. You asked about 
acking pending CheckpointMarks.

If you look in the PubsubUnboundedSource, it doesn't close the client 
connection if the there are still unacked checkpoints. a) Closing the 
reader and b) closing the client connection, these are two different 
actions which do not have to depend on each other.

On 05.10.18 16:06, flyisland wrote:
I've checked the PubsubUnboundedSource, it just throws an exception if 
it's a "restored checkpoint".

Actually, for most MQ system, it's no way to ack a message if the 
Reader(connection) is closed. So it makes no sense to call the 
finalizeCheckpoint() method after closed the Reader.

On Fri, Oct 5, 2018 at 9:01 PM Maximilian Michels <mailto:m...@apache.org>> wrote:

Hi,

Not sure whether I'm a guru but I'll try to answer your question ;)

 > Is there any way to ask the runner to call finalizeCheckpoint()
method before it closed the Reader?
Not that I'm aware of.

The point the comment is trying to make is that CheckpointMark should
not depend on the life cycle of the Reader. So you should find a way to
encode all necessary information in your CheckpointMark to acknowledge
even after `close()` has been called on the Reader.

Perhaps you want to check out out PubsubUnboundedSource for an example
of how to do that?

Cheers,
Max

On 05.10.18 14:47, flyisland wrote:
 > Hi Gurus,
 >
 > I'm building a new IO connector now, and I try to ack messages in
the
 > "UnboundedSource.CheckpointMark.finalizeCheckpoint()" method as
MqttIO
 > and JmsIO did, but I found in the javadoc it said
 >
 >  >  It is NOT safe to assume the UnboundedSource.UnboundedReader
from
 > which this checkpoint was created still exists at the time this
method
 > is called.
 >
 > I do encounter this situation in my testing with the Direct
Runner, the
 > "msg.ack()" method failed when the finalizeCheckpoint() method is
called
 > since the related reader has already been closed!
 >
 > Is there any way to ask the runner to call finalizeCheckpoint()
method
 > before it closed the Reader?

Beam Summit community feedback

2018-10-05 Thread Maximilian Michels


Hi,

What do you think about collecting some of the feedback from the 
community at Beam Summit last week? Here's what I've come across:



* The Kubernetes / Docker Story

Multiple users reported that they would like a Beam-Kubernetes story. 
What is the best way to deploy Beam with Kubernetes? Will there be 
built-in support?


Especially with regards to the portability, there are some unsolved 
problems, e.g. how to start Beam containerized and bootstrap the SDK 
Harness container from within a container? For local testing with the 
JobServer we support that via mounting the Docker socket, but this will 
be too fragile in production scenarios. Now that we have process-based 
execution, we could just use that inside the main container.


Deployment is a very important topic for users and we should try to 
reduce complexity as much as possible.


* External SDKs / Scio

Users have asked why Scio is not part of the main repository. Generally, 
I don't think that has to be the case, same for the Runners which are 
not part of the main repo. However, it does raise the question, what 
will be the future model for maintaining SDKs/IOs/Runners? How do we 
ensure easy development and a consistent quality of internal/external 
components?


* Documenting Timers & State

These two have excellent blog posts but are not part of the official 
documentation. Since they are part of the model, it would be good to 
eventually update the docs.


* Better Debuggability of pipelines

Even a simple WordCount in Beam leads to a quite complex Flink execution 
graph (due to the the involved I/O logic). How can we make pipelines 
easier to understand? Will we provide a way to visualize the 
architecture of high-level Beam pipelines? If so, do we provide a way to 
gain insight into how it is mapped to the Runner execution model? Users 
would like to have more insight.


* Current Roadmap

This was asked in the context of portability. By the end of the year we 
should have at least the FlinkRunner in a ready state, with the rest 
following up. There are a lot of others threads in Beam. The newsletter 
is a great way to keep up with the project development.



Looking forward to any other points you might have.

Best,
Max

Re: Portable Flink runner: Generator source for testing

2018-10-08 Thread Maximilian Michels

This is correct. However, the example code is only part of Lyft's code 
base. Until timer support is done, we would have to do something similar 
in our code base.


On 08.10.18 02:34, Łukasz Gajowy wrote:

Hi,

just to clarify, judging from the above snippets: it seems that we are 
able now to run tests that use a native source for data generation and 
use them in this form until the Timers are supported. When Timers are 
there, we should consider switching to the Impulse + PTransform based 
solution (described above) because it's more portable - the current is 
dedicated to Flink only (which still is really cool). Is this correct or 
am I missing something?


Łukasz

pt., 5 paź 2018 o 14:04 Maximilian Michels <mailto:m...@apache.org>> napisał(a):


Thanks for sharing your setup. You're right that we need timers to
continuously ingest data to the testing pipeline.

Here is the Flink source which generates the data:

https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42

On 04.10.18 19:31, Thomas Weise wrote:
 > FYI here is an example with native generator for portable Flink
runner:
 >
 > https://github.com/mwylde/beam/tree/micah_memory_leak
 >

https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
 >
 > You can use it to run the portable Flink runner in streaming mode
 > continuously for testing purposes.
 >
 >
 > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise mailto:t...@apache.org>
 > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
     >
     >
 >
 >     On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
mailto:m...@apache.org>
 >     <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >          > and then have Flink manage the parallelism for stages
 >         downstream from that?@Pablo Can you clarify what you mean
by that?
 >
 >         Let me paraphrase this just to get a clear understanding.
There
 >         are two
 >         approaches to test portable streaming pipelines:
 >
 >         a) Use an Impulse followed by a test PTransform which
generates
 >         testing
 >         data. This is similar to how streaming sources work which
don't
 >         use the
 >         Read Transform. For basic testing this should work, even
without
 >         support
 >         for Timers.
 >
 >
 >     AFAIK this works for bounded sources and batch mode of the Flink
 >     runner (staged execution).
 >
 >     For streaming we need small bundles, we cannot have a Python
ParDo
 >     block to emit records periodically.
 >
 >     (With timers, the ParDo wouldn't block but instead schedule
itself
 >     as needed.)
 >
 >         b) Introduce a new URN which gets translated to a native
 >         Flink/Spark/xy
 >         testing transform.
 >
 >         We should go for a) as this will make testing easier across
 >         portable
 >         runners. We previously discussed native transforms will be an
 >         option in
 >         Beam, but it would be preferable to leave them out of testing
 >         for now.
 >
 >         Thanks,
 >         Max
 >
 >
 >         On 28.09.18 21:14, Thomas Weise wrote:
 >          > Thanks for sharing the link, this looks very promising!
 >          >
 >          > For the synthetic source, if we need a runner native
trigger
 >         mechanism,
 >          > then it should probably just emit an empty byte array like
 >         the impulse
 >          > implementation does, and everything else could be left
to SDK
 >         specific
 >          > transforms that are downstream. We don't have support for
 >         timers in the
 >          > portable Flink runner yet. With timers, there would not be
 >         the need for
 >          > a runner native URN and it could work just like Pablo
described.
 >          >
 >          >
 >          > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
 >         mailto:lukasz.gaj...@gmail.com>
<mailto:lukasz.gaj...@gmail.com <mailto:lukasz.gaj...@gmail.com>>
 >          > <mailto:lukasz.gaj...@gmail.com
<mailto:lukasz.gaj...@gmail.com>
 >         <mailto:lukasz.gaj...@gmail.com
<mailto:lukasz.gaj...@gmail.com>>>> wrote:
 >

Re: Is there any way to ask the runner to call finalizeCheckpoint() method before it closed the Reader?

2018-10-08 Thread Maximilian Michels

For the UnboundedSource interface, this depends on the Runner. 
Generally, close() should be called when no more data will be read from 
the Reader. The FlinkRunner calls `close()` on the Readers when it 
closes the operator (see UnboundedSourceWrapper).

The best documentation we have for this are the JavaDoc comments.

Thanks,
Max

On 06.10.18 15:30, flyisland wrote:

Got it, thanks, will double check the PubsubUnboundedSource.

btw, I'd like to learn more about the lifecycle of IO connector(for 
example, when will the runner call the reader's close() method?), could 
you recommend some documents, thanks!

On Fri, Oct 5, 2018 at 11:01 PM Maximilian Michels <mailto:m...@apache.org>> wrote:

Restoring from a checkpoint is something different. You asked about
acking pending CheckpointMarks.

If you look in the PubsubUnboundedSource, it doesn't close the client
connection if the there are still unacked checkpoints. a) Closing the
reader and b) closing the client connection, these are two different
actions which do not have to depend on each other.

On 05.10.18 16:06, flyisland wrote:
 > I've checked the PubsubUnboundedSource, it just throws an
exception if
 > it's a "restored checkpoint".
 >
 > Actually, for most MQ system, it's no way to ack a message if the
 > Reader(connection) is closed. So it makes no sense to call the
 > finalizeCheckpoint() method after closed the Reader.
 >
 > On Fri, Oct 5, 2018 at 9:01 PM Maximilian Michels mailto:m...@apache.org>
 > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >     Hi,
 >
 >     Not sure whether I'm a guru but I'll try to answer your
question ;)
 >
 >      > Is there any way to ask the runner to call
finalizeCheckpoint()
 >     method before it closed the Reader?
 >     Not that I'm aware of.
 >
 >     The point the comment is trying to make is that
CheckpointMark should
 >     not depend on the life cycle of the Reader. So you should
find a way to
 >     encode all necessary information in your CheckpointMark to
acknowledge
 >     even after `close()` has been called on the Reader.
 >
 >     Perhaps you want to check out out PubsubUnboundedSource for
an example
 >     of how to do that?
 >
 >     Cheers,
 >     Max
 >
 >     On 05.10.18 14:47, flyisland wrote:
 >      > Hi Gurus,
 >      >
 >      > I'm building a new IO connector now, and I try to ack
messages in
 >     the
 >      > "UnboundedSource.CheckpointMark.finalizeCheckpoint()"
method as
 >     MqttIO
 >      > and JmsIO did, but I found in the javadoc it said
 >      >
 >      >  >  It is NOT safe to assume the
UnboundedSource.UnboundedReader
 >     from
 >      > which this checkpoint was created still exists at the time
this
 >     method
 >      > is called.
 >      >
 >      > I do encounter this situation in my testing with the Direct
 >     Runner, the
 >      > "msg.ack()" method failed when the finalizeCheckpoint()
method is
 >     called
 >      > since the related reader has already been closed!
 >      >
 >      > Is there any way to ask the runner to call
finalizeCheckpoint()
 >     method
 >      > before it closed the Reader?
 >

Re: [DISCUSS] - Separate JIRA notifications to a new mailing list

2018-10-11 Thread Maximilian Michels


+1

I guess most people have already filters in place to separate commits 
and JIRA issues. JIRA really has nothing to do in the commits list.


On 11.10.18 15:53, Kenneth Knowles wrote:

+1

I've suggested the same. Canonical.

On Thu, Oct 11, 2018, 06:19 Thomas Weise > wrote:


+1


On Thu, Oct 11, 2018 at 6:18 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

+1 for me also, my gmail filters list is kind of overflowed :)

Etienne

Le jeudi 11 octobre 2018 à 14:44 +0200, Robert Bradshaw a écrit :

Huge +1 from me too.
On Thu, Oct 11, 2018 at 2:42 PM Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:

+1

We are doing the same in Karaf as well.

Regards
JB

On 11/10/2018 14:35, Colm O hEigeartaigh wrote:
Hi all,

Apologies in advance if this has already been discussed (and rejected).
I was wondering if it would be a good idea to create a new mailing list
and divert the JIRA notifications to it? Currently
"comm...@beam.apache.org   
>" receives both
the git and JIRA notifications, and has a huge volume of traffic as a
result.

Separating JIRA notifications from commit messages would allow users to
subscribe to whichever are of interest without having to write a mail
filter if e.g. they are not interested in JIRA notifications. It also
seems a bit unintuitive to me to expect JIRA notifications to go to an
email list called "commits".

As a reference point - Apache CXF maintains a "commits" list for git
notifications and "issues" for JIRA notifications:

http://cxf.apache.org/mailing-lists.html

Thanks!

Colm.

--
Colm O hEigeartaigh

Talend Community Coder
http://coders.talend.com

--
Jean-Baptiste Onofré
jbono...@apache.org 
http://blog.nanthrax.net
Talend -http://www.talend.com

Re: Beam Samza Runner status update

2018-10-12 Thread Maximilian Michels

Thanks for the updating, Xinyu and Hai! Great to see another Running 
emerging :)


I'm on the FlinkRunner. Looking forward to working together with you to 
make the Beam Runners even better. Particularly, we should sync on the 
portability, as some things are still to be fleshed out. In Flink, we 
are starting to integrate portable State.


Best,
Max

On 11.10.18 05:14, Jesse Anderson wrote:

Interesting

On Wed, Oct 10, 2018, 3:49 PM Kenneth Knowles > wrote:


Welcome, Hai!

On Wed, Oct 10, 2018 at 3:46 PM Hai Lu mailto:lhai...@gmail.com>> wrote:

Hi, all

This is Hai from LinkedIn. As Xinyu mentioned, I have been
working on portable API for Samza runner and made some solid
progress. It's been a very smooth process (although not
effortless for sure) and I'm really grateful for the great
platform that you all have built. I'm very impressed. Bravo!

Excited to work with everyone on Beam. Do expect more questions
from me down the road.

Thanks,
Hai

On Wed, Oct 10, 2018 at 12:36 PM Kenneth Knowles
mailto:k...@apache.org>> wrote:

Clarification: Thomas Groh wrote the fuser, not me!

Thanks for the sharing all this. Really cool.

Kenn

On Wed, Oct 10, 2018 at 11:17 AM Rui Wang mailto:ruw...@google.com>> wrote:

Thanks for sharing! it's so exciting to hear that Beam
is being used on Samza in production @LinkedIn! Your
feedback will be helpful to Beam community!

Besides, Beam supports SQL right now and hopefully Beam
community could also receive feedback on BeamSQL
 in
the future.

-Rui

On Wed, Oct 10, 2018 at 11:10 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

Thanks for sharing and congrats for this great work !

Regards
JB
Le 10 oct. 2018, à 20:23, Xinyu Liu mailto:%3Ca>@gmail.com 
target=_blank>xinyuliu.us
@gmail.com > a
écrit:

Hi, All,

It's been over four months since we added the
Samza Runner to Beam, and we've been making a
lot of progress after that. Here I would like to
update your guys and share some really good news
happening here at LinkedIn:

1) First Beam job in production @LInkedIn!
After a few rounds of testing and benchmarking,
we finally rolled out our first Beam job here!
The job uses quite a few features, such as event
time, fixed/session windowing, early triggering,
and stateful processing. Our first customer is
very happy and they highly appraise the
easy-to-use Beam API as well as powerful
processing model. Due to the limited resources
here, we put our full trust in the work you guys
are doing, and we didn't run into any surprises.
We see extremely attention to details as well as
non-compromise in any user experience everywhere
in the code base. We would like to thank
everyone in the Beam community to contribute to
such an amazing framework!

2) A portable Samza Runner prototype
We are also starting the work in making Samza
Runner portable. So far we just got the python
word count example working using portable Samza
Runner. Please look out for the PR for this very
soon :). Again, this work is not possible
without the great Beam portability framework,
and the developers like Luke and Ahmet, just to
name a few, behind it. The ReferenceRunner has
been extremely useful to us to figure out what's
needed and how it works. Kudos to Thomas Groh,
Ben Sidhom and all the others who makes this
available to us. And to Kenn, your fuse work rocks.

3) More contributors in Samza Runner
The runner has been Chris and my personal
project for a while and now it's not the case.
We got Hai Lu

Re: [DISCUSS] Beam public roadmap

2018-10-12 Thread Maximilian Michels

Great idea, Kenn!

How about putting the roadmap in the Confluent wiki? We can link the
page from the web site.

The timeline should not be too specific but should give users an idea of
what to expect.

On 10.10.18 22:43, Romain Manni-Bucau wrote:
What about a link in the menu. It should contain a list of features and
estimate date with probable error (like "in 5 months +- 1 months)
otherwise it does not bring much IMHO.

Le mer. 10 oct. 2018 23:32, Kenneth Knowles > a écrit :

Hi all,

We made an attempt at putting together a sort of roadmap [1] in the
past and also some wide-ranging threads about what could be on it
[2]. and I think we should pick it up again. The description I
really liked was "strategic and user impacting initiatives (ongoing
and future) in an easy to consume format" [3]. It seems that we had
feedback asking for a Roadmap at the London summit [4].

I would like to first focus on meta-questions rather than what would
be on it:

- What style / format should it have to be most useful for users?
- Where should it be presented?

I asked a couple people to try to find the roadmap on the web site,
as a test, and they didn't really know which tab to click on first,
so that's a starting problem. They didn't even find Works In
Progress [5] after clicking Contribute. The level of detail of that
list varies widely.

I'd also love to see hypothetical formats for it, to see how to
balance pithiness with crucial details.

Kenn

[1]

https://lists.apache.org/thread.html/4e1fffa2fde8e750c6d769bf4335853ad05b360b8bd248ad119cc185@%3Cdev.beam.apache.org%3E
[2]

https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E
[3]

https://lists.apache.org/thread.html/60d0333fd9e2c7be2f55e33b0d145f2908e3fe645c008636c86e1133@%3Cdev.beam.apache.org%3E
[4]

https://lists.apache.org/thread.html/aa1306da25029dff12a49ba3ce63f2caf6a5f8ba73eda879c8403f3f@%3Cdev.beam.apache.org%3E

[5] https://beam.apache.org/contribute/#works-in-progress

Re: [DISCUSS] - Separate JIRA notifications to a new mailing list

2018-10-15 Thread Maximilian Michels

The list has been created. You can already subscribe to the list by 
sending an empty mail to: issues-subscr...@beam.apache.org

Could we announce the time when we plan to make the switch to the new 
list? Also, the mailing list page needs to be updated: 
https://beam.apache.org/community/contact-us/

On 11.10.18 23:21, Rui Wang wrote:

+1

-Rui

On Thu, Oct 11, 2018 at 12:53 PM Ankur Goenka <mailto:goe...@google.com>> wrote:

+1

On Thu, Oct 11, 2018 at 12:14 PM Suneel Marthi
mailto:suneel.mar...@gmail.com>> wrote:

+1

Sent from my iPhone

On Oct 11, 2018, at 8:03 PM, Łukasz Gajowy
mailto:lukasz.gaj...@gmail.com>> wrote:

This is a good idea. +1

Łukasz

czw., 11 paź 2018, 18:01 użytkownik Udi Meiri
mailto:eh...@google.com>> napisał:

+1 to split JIRA notifications

On Thu, Oct 11, 2018 at 9:13 AM Kenneth Knowles
mailto:k...@apache.org>> wrote:

On Thu, Oct 11, 2018 at 9:10 AM Mikhail Gryzykhin
mailto:gryzykhin.mikh...@gmail.com>> wrote:

+1.
Should we separate Jenkins notifications as well?

I'm worried this question will get buried in the
thread. Would you mind separating it into another
thread if you would like to discuss?

Kenn

On Thu, Oct 11, 2018, 08:59 Scott Wegner
mailto:sc...@apache.org> wrote:

+1, commits@ is too noisy to be useful currently.

    On Thu, Oct 11, 2018 at 8:04 AM Maximilian
Michels mailto:m...@apache.org>> wrote:

+1

I guess most people have already filters
in place to separate commits
and JIRA issues. JIRA really has nothing
to do in the commits list.

On 11.10.18 15:53, Kenneth Knowles wrote:
> +1
>
> I've suggested the same. Canonical.
>
> On Thu, Oct 11, 2018, 06:19 Thomas Weise
mailto:t...@apache.org>
> <mailto:t...@apache.org
<mailto:t...@apache.org>>> wrote:
>
>     +1
>
>
>     On Thu, Oct 11, 2018 at 6:18 AM
Etienne Chauchot
>     mailto:echauc...@apache.org>
<mailto:echauc...@apache.org
<mailto:echauc...@apache.org>>> wrote:
>
>         +1 for me also, my gmail filters
list is kind of overflowed :)
>
>         Etienne
>
>         Le jeudi 11 octobre 2018 à 14:44
+0200, Robert Bradshaw a écrit :
>>         Huge +1 from me too.
>>         On Thu, Oct 11, 2018 at 2:42 PM
Jean-Baptiste Onofré mailto:j...@nanthrax.net>
<mailto:j...@nanthrax.net
<mailto:j...@nanthrax.net>>> wrote:
>>
>>         +1
>>
>>         We are doing the same in Karaf
as well.
>>
>>         Regards
>>         JB
>>
>>         On 11/10/2018 14:35, Colm O
hEigeartaigh wrote:
>>         Hi all,
>>
>>         Apologies in advance if this
has already been discussed (and rejected).
>>         I was wondering if it would be
a good idea to create a new mailing list
>>         and divert the JIRA
notifications to it? Currently
>>         "comm...@beam.apache.org
<mailto:comm...@beam.apache.org>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-15 Thread Maximilian Michels

I agree that the current approach breaks the pipeline options contract 
because "unknown" options get parsed in the same way as options which 
have been defined by the user.


I'm not sure the `experiments` flag works for us. AFAIK it only allows 
true/false flags. We want to pass all types of pipeline options to the 
Runner.


How to solve this?

1) Add all options of all Runners to each SDK
We added some of the FlinkRunner options to the Python SDK but realized 
syncing is rather cumbersome in the long term. However, we want the most 
important options to be validated on the client side.


2) Pass "unknown" options via a separate list in the Proto which can 
only be accessed internally by the Runners. This still allows passing 
arbitrary options but we wouldn't leak unknown options and display them 
as top-level options.


-Max

On 13.10.18 02:34, Charles Chen wrote:
The current release branch 
(https://github.com/apache/beam/commits/release-2.8.0) was cut after the 
revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a 
revert of the revert.  Regarding your comment above, I can help out with 
the design / PR reviews for common Python code as you suggest.


On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise > wrote:


Thanks, will tag you and looking forward to feedback so we can
ensure that changes work for everyone.

Looking at the PR, I see agreement from Max to revert the change on
the release branch, but not in master. Would you mind to restore it
in master?

Thanks

On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay mailto:al...@google.com>> wrote:



On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen mailto:c...@google.com>> wrote:

What I mean is that a user may find that it works for them
to pass "--myarg blah" and access it as "options.myarg"
without explicitly defining a "my_arg" flag due to the added
logic.  This is not the intended behavior and we may want to
change this implementation detail in the future.  However,
having this logic in a released version makes it hard to
change this behavior since users may erroneously depend on
this undocumented behavior.  Instead, we should namespace /
scope this so that it is obvious that this is meant for
runner (and not Beam user) consumption.

On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
mailto:t...@apache.org>> wrote:

Can you please elaborate more what practical problems
this introduces for users?

I can see that this change allows a user to specify a
runner specific option, which in the future may change
because we decide to scope differently. If this only
affects users of the portable Flink runner (like us),
then no need to revert, because at this early stage we
prefer something that works over being blocked.

It would also be really great if some of the core Python
SDK developers could help out with the design aspects
and PR reviews of changes that affect common Python
code. Anyone who specifically wants to be tagged on
relevant JIRAs and PRs?


I would be happy to be tagged, and I can also help with
including other relevant folks whenever possible. In general I
think Robert, Charles, myself are good candidates.


Thanks


On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
mailto:al...@google.com>> wrote:



On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
mailto:c...@google.com>> wrote:

For context, I made comments on
https://github.com/apache/beam/pull/6600 noting
that the changes being made were not good for
Beam backwards-compatibility.  The change as is
allows users to use pipeline options without
explicitly defining them, which is not the type
of usage we would like to encourage since we
prefer to be explicit whenever possible.  If
users write pipelines with this sort of pattern,
they will potentially encounter pain when
upgrading to a later version since this is an
implementation detail and not an officially
supported pattern.  I agree with the comments
above that this is ultimately a scoping issue. 
I would not have a problem with these changes if

they were explicitly scoped under either a
runner or unparsed options namespace.

Re: [DISCUSS] Separate Jenkins notifications to a new mailing list

2018-10-16 Thread Maximilian Michels

+1 I can switch of all my filters then, and people new here will be less 
overwhelmed by email.


On 16.10.18 12:46, Alexey Romanenko wrote:

+1

On 16 Oct 2018, at 00:02, Chamikara Jayalath > wrote:


+1 for new lists.

Thanks,
Cham

On Mon, Oct 15, 2018 at 12:09 PM Ismaël Mejía > wrote:


+1
On Mon, Oct 15, 2018 at 8:14 PM Mikhail Gryzykhin
mailto:mig...@google.com>> wrote:
>
> +1 also suggested that in another thread.
>
> --Mikhail
>
> Have feedback?
>
>
> On Mon, Oct 15, 2018 at 11:10 AM Rui Wang mailto:ruw...@google.com>> wrote:
>>
>> +1
>>
>> Agree that people might be only interested in JIRA activities
but not in the build.
>>
>>
>> -Rui
>>
>> On Mon, Oct 15, 2018 at 10:27 AM Andrew Pilloud
mailto:apill...@google.com>> wrote:
>>>
>>> +1. A bunch of Jenkins spam goes to dev as well. Would be good
to it all to a new list.
>>>
>>> Andrew
>>>
>>> On Mon, Oct 15, 2018 at 10:00 AM Colm O hEigeartaigh
mailto:cohei...@apache.org>> wrote:

 As a rationale, some users might be interested in seeing JIRA
activity but might not care about whether the build is broken or
not :-)

 Colm.

 On Mon, Oct 15, 2018 at 5:56 PM Kenneth Knowles
mailto:k...@apache.org>> wrote:
>
> Separating this out from the Jira notification thread.
>
> Colm suggests that we also separate build notifications.
>
> WDYT?
>
> Kenn



 --
 Colm O hEigeartaigh

 Talend Community Coder
 http://coders.talend.com

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-16 Thread Maximilian Michels

er if we should use urns for namespacing, and
 >> assigning semantic meaning to strings, here.
 >>
 >> > 4) we use a string which must be a specific format such as
JSON (allows the SDK to do simple validation):
 >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
 >>
 >> I like this in that at least some validation can be
performed, and
 >> expectations of how to format richer types. On the other
hand it gets
 >> a bit verbose, given that most (I'd imagine) options will be
simple.
 >> As with normal options,
 >>
 >>     --option1=value1 --option2=value2
 >>
 >> is shorthand for {"option1": value1, "option2": value2}.
 >>
 > I lean to 4 the most. With 2, you run into issues of what
does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
"d"] mean?
 > Is it an error or list of lists or concatenated. Similar
issues for map types represented via JSON object {...}

We can err to be on the safe side unless/until an argument can
be made
that merging is more natural. I just think this will be excessively
verbose to use.

 >> > I would strongly suggest that we go with the "fetch"
approach, since this makes the set of options discoverable and
helps users find errors much earlier in their pipeline.
 >>
 >> This seems like an advanced feature that SDKs may want to
support, but
 >> I wouldn't want to require this complexity for bootstrapping
an SDK.
 >>
 > SDKs that are starting off wouldn't need to "fetch" options,
they could choose to not support runner options or they could
choose to pass all options through to the runner blindly.
Fetching the options only provides the SDK the ability to
provide error checking upfront and useful error/help messages.

But how to even pass all options through blindly is exactly the
difficulty we're running into here.

 >> Regarding always keeping runner options separate, +1, though
I'm not
 >> sure the line is always clear.
 >>
 >>
 >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 >> >>
 >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >> >> >
 >> >> > I agree that the current approach breaks the pipeline
options contract
 >> >> > because "unknown" options get parsed in the same way as
options which
 >> >> > have been defined by the user.
 >> >>
 >> >> FWIW, I think we're already breaking this "contract."
Unknown options
 >> >> are silently ignored; with this change we just change how
we record
 >> >> them. It still feels a bit hacky though.
 >> >>
 >> >> > I'm not sure the `experiments` flag works for us. AFAIK
it only allows
 >> >> > true/false flags. We want to pass all types of pipeline
options to the
 >> >> > Runner.
 >> >>
 >> >> Experiments is an arbitrary set of strings, which can be
of the form
 >> >> "param=value" if that's useful. (Dataflow does this.)
There is, again,
 >> >> no namespacing on the param names, but we could user urns
or impose
 >> >> some other structure here.
 >> >>
 >> >> > How to solve this?
 >> >> >
 >> >> > 1) Add all options of all Runners to each SDK
 >> >> > We added some of the FlinkRunner options to the Python
SDK but realized
 >> >> > syncing is rather cumbersome in the long term. However,
we want the most
 >> >> > important options to be validated on the client side.
 >> >>
 >> >> I don't think this is sustainable in the long run.
However, thinking
 >> >> about this, in the worse case validation happens after
construction
 >> >> but before execution (as with much of our other
vali

Re: a new contributor


Hi Heejong,

Thanks for introducing yourself! Welcome :)

-Max

On 19.10.18 21:45, Ankur Goenka wrote:

Welcome Heejong!

On Fri, Oct 19, 2018 at 12:27 PM Rui Wang > wrote:


Welcome!

-Rui

On Fri, Oct 19, 2018 at 11:55 AM Robin Qiu mailto:robi...@google.com>> wrote:

Welcome, Heejong!

On Fri, Oct 19, 2018 at 11:55 AM Ahmet Altay mailto:al...@google.com>> wrote:

Welcome!

On Fri, Oct 19, 2018 at 11:48 AM, Heejong Lee
mailto:heej...@google.com>> wrote:

Hi,

I just wanted to introduce myself as a new contributor.
I'm a new member of Apache Beam team at Google and will
be working on IO modules. Happy to meet you all!

Thanks,
Heejong

Re: [DISCUSS] Move beam_SeedJob notifications to another email address


Hi Rui,

The seed job being broken is sort of a big deal because it prevents 
updates to our Jenkins jobs. However, it doesn't stop the existing test 
configurations from running. I haven't found the mails annoying but I'm 
ok with moving them to the builds@ list.


-Max

On 22.10.18 11:05, Colm O hEigeartaigh wrote:
We had a discussion recently about splitting the JIRA notifications to a 
new list "issues@b.a.o", and also splitting the Jenkins mails to 
(potentially) "builds@b.a.o". So I guess the issue you raised can be 
done once the latter mailing list is set up and active.


Colm.

On Fri, Oct 19, 2018 at 11:29 PM Rui Wang > wrote:


Hi Community,

I have seen some Jenkins build failure/back-to-normal emails in dev@
in last several months. Seems to me that this setting is coded in

https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_00_seed.groovy#L100.

In the link above, the comment says the seed job is very important
so the notification emails should be sent to dev@.

I am wondering if this is still true that we always want to see such
notifications in dev@? If such notifications have become spams to
dev@, can we move it to either commits@ or another dedicated
email address (maybe create a new one)?

-Rui



--
Colm O hEigeartaigh

Talend Community Coder
http://coders.talend.com

Re: [ANNOUNCE] New committers, October 2018


Congrats Ankur and Xinyu!

On 19.10.18 21:27, Rui Wang wrote:

Congrats and thanks for your contributions!

-Rui

On Fri, Oct 19, 2018 at 11:55 AM Ahmet Altay > wrote:


Congratulations to both of you! :)

On Fri, Oct 19, 2018 at 11:52 AM, Robin Qiu mailto:robi...@google.com>> wrote:

Congrats, Xinyu and Ankur!

On Fri, Oct 19, 2018 at 11:51 AM Daniel Oliveira
mailto:danolive...@google.com>> wrote:

Congratulations!

On Fri, Oct 19, 2018 at 8:27 AM Thomas Weise mailto:t...@apache.org>> wrote:

Congrats!


On Fri, Oct 19, 2018 at 7:24 AM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Congratulations guys and welcome !
On Fri, Oct 19, 2018 at 4:12 PM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:
 >
 > Congrats and welcome aboard !
 >
 > Regards
 > JB
 >
 > On 19/10/2018 16:09, Kenneth Knowles wrote:
 > > Hi all,
 > >
 > > Hot on the tail of the summer announcement
comes our pre-Hallowe'en
 > > celebration.
 > >
 > > Please join me and the rest of the Beam PMC in
welcoming the following
 > > new committers:
 > >
 > >  - Xinyu Liu, author/maintainer of the Samza runner
 > >  - Ankur Goenka, major contributor to
portability efforts
 > >
 > > And, as before, while I've noted some areas of
contribution for each,
 > > most important is that they are a valued part
of our Beam community that
 > > the PMC trusts with the responsibilities of a
Beam committer [1].
 > >
 > > A big thanks to both for their contributions.
 > >
 > > Kenn
 > >
 > > [1]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
 >
 > --
 > Jean-Baptiste Onofré
 > jbono...@apache.org 
 > http://blog.nanthrax.net
 > Talend - http://www.talend.com

Re: [DISCUSS] Move beam_SeedJob notifications to another email address

Ah good point. That's why we were frequently seeing these mails. Would 
be nice to be able to test Jenkins DSL changes without generating an 
email to a mailing list.

On 22.10.18 17:25, Kenneth Knowles wrote:
Another thing about the seed job: for jobs like 'Java Postcommit" there 
is a separate job for whether it is run as an actual postcommit or 
whether it is run against a PR. The seed job has not been split this 
way. So when someone is testing the seed job on a PR failures look the 
same as if it is broken on master.

Kenn

On Mon, Oct 22, 2018 at 2:15 AM Maximilian Michels <mailto:m...@apache.org>> wrote:

Hi Rui,

The seed job being broken is sort of a big deal because it prevents
updates to our Jenkins jobs. However, it doesn't stop the existing test
configurations from running. I haven't found the mails annoying but I'm
ok with moving them to the builds@ list.

-Max

On 22.10.18 11:05, Colm O hEigeartaigh wrote:
 > We had a discussion recently about splitting the JIRA
notifications to a
 > new list "issues@b.a.o", and also splitting the Jenkins mails to
 > (potentially) "builds@b.a.o". So I guess the issue you raised can be
 > done once the latter mailing list is set up and active.
 >
 > Colm.
 >
 > On Fri, Oct 19, 2018 at 11:29 PM Rui Wang mailto:ruw...@google.com>
 > <mailto:ruw...@google.com <mailto:ruw...@google.com>>> wrote:
 >
 >     Hi Community,
 >
 >     I have seen some Jenkins build failure/back-to-normal emails
in dev@
 >     in last several months. Seems to me that this setting is coded in
 >

https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_00_seed.groovy#L100.
 >
 >     In the link above, the comment says the seed job is very
important
 >     so the notification emails should be sent to dev@.
 >
 >     I am wondering if this is still true that we always want to
see such
 >     notifications in dev@? If such notifications have become spams to
 >     dev@, can we move it to either commits@ or another dedicated
 >     email address (maybe create a new one)?
 >
 >     -Rui
 >
 >
 >
 > --
 > Colm O hEigeartaigh
 >
 > Talend Community Coder
 > http://coders.talend.com

Re: Python docs build error


Correction for the footnote:

[1] https://github.com/apache/beam/pull/6637

On 22.10.18 15:24, Maximilian Michels wrote:

Hi Colm,

This [1] got merged recently and broke the "docs" target which 
apparently is not part of our Python PreCommit tests.


See the following PR for a fix: https://github.com/apache/beam/pull/6774

Best,
Max

[1] https://github.com/apache/beam/pull/6737

On 22.10.18 12:55, Colm O hEigeartaigh wrote:

Hi all,

The following command: ./gradlew :beam-sdks-python:docs gives me the 
following error:


/home/coheig/src/apache/beam/sdks/python/apache_beam/io/flink/flink_streaming_impulse_source.py:docstring 
of 
apache_beam.io.flink.flink_streaming_impulse_source.FlinkStreamingImpulseSource.from_runner_api_parameter:11: 
WARNING: Unexpected indentation.

Command exited with non-zero status 1
42.81user 4.02system 0:16.27elapsed 287%CPU (0avgtext+0avgdata 
141036maxresident)k

0inputs+47792outputs (0major+727274minor)pagefaults 0swaps
ERROR: InvocationError for command '/usr/bin/time 
/home/coheig/src/apache/beam/sdks/python/scripts/generate_pydoc.sh' 
(exited with code 1)
___ summary 


ERROR:   docs: commands failed

 > Task :beam-sdks-python:docs FAILED

FAILURE: Build failed with an exception.

Am I missing something or is there an issue here?

Thanks,

Colm.


--
Colm O hEigeartaigh

Talend Community Coder
http://coders.talend.com

Re: [DISCUSS] Publish vendored dependencies independently

+1 for publishing vendored Jars independently. It will improve build 
time and ease IntelliJ integration.


Flink also publishes shaded dependencies separately:

- https://github.com/apache/flink-shaded
- https://issues.apache.org/jira/browse/FLINK-6529

AFAIK their main motivation was to get rid of duplicate shaded classes 
on the classpath. We don't appear to have that problem because we 
already have a separate "vendor" project.



 - With shading, it is hard (impossible?) to step into dependency code in 
IntelliJ's debugger, because the actual symbol at runtime does not match what 
is in the external jars


This would be solved by releasing the sources of the shaded jars. From a 
legal perspective, this could be problematic as alluded to here: 
https://github.com/apache/flink-shaded/issues/25


-Max

On 20.10.18 01:11, Lukasz Cwik wrote:
I have tried several times to improve the build system and intellij 
integration and each attempt ended with little progress when dealing 
with vendored code. My latest attempt has been the most promising where 
I take the vendored classes/jars and decompile them generating the 
source that Intellij can then use. I have a branch[1] that demonstrates 
the idea. It works pretty well (and up until a change where we started 
vendoring gRPC, was impractical to do. Instructions to try it out are:


// Clean up any remnants of prior builds/intellij projects
git clean -fdx
// Generated the source for vendored/shaded modules
./gradlew decompile

// Remove the "generated" Java sources for protos so they don't conflict with 
the decompiled sources.
rm -rf model/pipeline/build/generated/source/proto
rm -rf model/job-management/build/generated/source/proto
rm -rf model/fn-execution/build/generated/source/proto
// Import the project into Intellij, most code completion now works still some 
issues with a few classes.
// Note that the Java decompiler doesn't generate valid source so still need to 
delegate to Gradle for build/run/test actions
// Other decompilers may do a better/worse job but haven't tried them.


The problems that I face are that the generated Java source from the 
protos and the decompiled source from the compiled version of that 
source post shading are both being imported as content roots and then 
conflict. Also, the CFR decompiler isn't producing valid source, if 
people could try others and report their mileage, we may find one that 
works and then we would be able to use intellij to build/run our code 
and not need to delegate all our build/run/test actions to Gradle.


After all these attempts I have done, vendoring the dependencies outside 
of the project seems like a sane approach and unless someone wants to 
take a stab at the best progress I have made above, I would go with what 
Kenn is suggesting even though it will mean that we will need to perform 
releases every time we want to change the version of one of our vendored 
dependencies.


1: https://github.com/lukecwik/incubator-beam/tree/intellij


On Fri, Oct 19, 2018 at 10:43 AM Kenneth Knowles > wrote:


Another reason to push on this is to get build times down. Once only
generated proto classes use the shadow plugin we'll cut the build
time in ~half? And there is no reason to constantly re-vendor.

Kenn

On Fri, Oct 19, 2018 at 10:39 AM Kenneth Knowles mailto:k...@google.com>> wrote:

Hi all,

A while ago we had pretty good consensus that we should vendor
("pre-shade") specific dependencies, and there's start on it
here: https://github.com/apache/beam/tree/master/vendor

IntelliJ notes:

  - With shading, it is hard (impossible?) to step into
dependency code in IntelliJ's debugger, because the actual
symbol at runtime does not match what is in the external jars


Intellij can step through the classes if they were published outside the 
project since it can decompile them. The source won't be legible. 
Decompiling the source as in my example effectively shows the same issue.


  - With vendoring, if the vendored dependencies are part of the
project then IntelliJ gets confused because it operates on
source, not the produced jars


Yes, I tried several ways to get intellij to ignore the source and use 
the output jars but no luck.


The second one has a quick fix for most cases*: don't make the
vendored dep a subproject, but just separately build and publish
it. Since a vendored dependency should change much more
infrequently (or if we bake the version into the name, ~never)
this means we publish once and save headache and build time forever.

WDYT? Have I overlooked something? How about we set up vendored
versions of guava, protobuf, grpc, and publish them. We don't
have to actually start using them yet, and can do it incrementally.


Currently we are relocating code depending on the version string. If the

Re: Python docs build error


Hi Colm,

This [1] got merged recently and broke the "docs" target which 
apparently is not part of our Python PreCommit tests.


See the following PR for a fix: https://github.com/apache/beam/pull/6774

Best,
Max

[1] https://github.com/apache/beam/pull/6737

On 22.10.18 12:55, Colm O hEigeartaigh wrote:

Hi all,

The following command: ./gradlew :beam-sdks-python:docs gives me the 
following error:


/home/coheig/src/apache/beam/sdks/python/apache_beam/io/flink/flink_streaming_impulse_source.py:docstring 
of 
apache_beam.io.flink.flink_streaming_impulse_source.FlinkStreamingImpulseSource.from_runner_api_parameter:11: 
WARNING: Unexpected indentation.

Command exited with non-zero status 1
42.81user 4.02system 0:16.27elapsed 287%CPU (0avgtext+0avgdata 
141036maxresident)k

0inputs+47792outputs (0major+727274minor)pagefaults 0swaps
ERROR: InvocationError for command '/usr/bin/time 
/home/coheig/src/apache/beam/sdks/python/scripts/generate_pydoc.sh' 
(exited with code 1)
___ summary 


ERROR:   docs: commands failed

 > Task :beam-sdks-python:docs FAILED

FAILURE: Build failed with an exception.

Am I missing something or is there an issue here?

Thanks,

Colm.


--
Colm O hEigeartaigh

Talend Community Coder
http://coders.talend.com

Re: [ANNOUNCE] New committers & PMC members, Summer 2018 edition


Great to see the community growing!

On 16.10.18 18:20, Scott Wegner wrote:
Congrats all! And thanks Kenn and the PMC for recognizing these 
contributions.


On Mon, Oct 15, 2018 at 9:45 AM Kenneth Knowles > wrote:


Hi all,

Since our last announcement in May, we have added many more
committers and a new PMC member. Some of these may have been in the
monthly newsletter or mentioned elsewhere, but I wanted to be sure
to have a loud announcement on the list directly.

Please join me in belatedly welcoming...

New PMC member: Thomas Weise
  - Author of the ApexRunner, the first additional runner after Beam
began incubation.
  - Recently heavily involved in Python-on-Flink efforts.
  - Outside his contributions to Beam, Thomas is PMC chair for
Apache Apex.

New committers:

  - Charles Chen, longtime contributor to Python SDK, Python direct
runner, state & timers
  - Łukasz  Gajowy, testing infrastructure, benchmarks, build system
improvements
  - Anton Kedin, contributor to SQL and schemas, helper on StackOverflow
  - Andrew Pilloud, contributor to SQL, very active on dev@, infra
and release help
  - Tim Robertson, contributor to many IOs, major code health work
  - Alexey Romanenko, contributor to many IOs, Nexmark benchmarks
  - Henning Rohde, contributor to Go SDK, incl. ip fun, and
portability protos and design
  - Scott Wegner, one of our longest contributors, major infra
improvements

And while I've noted some areas of contribution for each, most
importantly everyone on this list is a valued member of the Beam
community that the PMC trusts with the responsibilities of a Beam
committer [1].

A big thanks to all for their contributions.

Kenn

[1]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer



--




Got feedback? tinyurl.com/swegner-feedback

Re: [PROPOSAL] allow the users to anticipate the support of features in the targeted runner.

This is a good idea. It needs to be fleshed out how the capability of a 
Runner would be visible to the user (apart from the compatibility matrix).


A dry-run feature would be useful, i.e. the user can run an inspection 
on the pipeline to see if it contains any features which are not 
supported by the Runner.


On 17.10.18 00:03, Rui Wang wrote:

Sounds like a good idea.

Sounds like while coding, user gets a list to show if a feature is 
supported on different runners. User can check the list for the answer. 
Is my understanding correct? Will this approach become slow as number of 
runner grows? (it's just a question as I am not familiar the performance 
of combination of a long list, annotation and IDE)



-Rui

On Sat, Oct 13, 2018 at 11:56 PM Reuven Lax > wrote:


Sounds like a good idea. I don't think it will work for all
capabilities (e.g. some of them such as "exactly once" apply to all
of the API surface), but useful for the ones that we can capture.

On Thu, Oct 4, 2018 at 2:43 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi guys,
As part of our user experience improvement to attract new Beam
users, I would like to suggest something:

Today we only have the capability matrix to inform users about
features support among runners. But, they might discover only
when the pipeline runs, when they receive an exception, that a
given feature is not supported by the targeted runner.
I would like to suggest to translate the capability matrix into
the API with annotations for example, so that, while coding, the
user could know that, for now, a given feature is not supported
on the runner he targets.

I know that the runner is only specified at pipeline runtime,
and that adding code would be a leak of runner implementation
and against portability. So it could be just informative
annotations like @Experimental for example with no annotation
processor.

WDYT?

Etienne

Re: [PROPOSAL] allow the users to anticipate the support of features in the targeted runner.

2018-10-18 Thread Maximilian Michels

Plugins for IDEs would be amazing because they could provide feedback 
already during pipeline construction, but I'm not sure about the effort 
required to develop/maintain such plugins.


Ultimately, Runners have to decide whether they can translate the given 
pipeline or not. So I'm leaning more towards an approach to make 
intermediate checking of the pipeline translation easier, e.g. by


- providing the target Runner already during development
- running check of the Runner alongside with the DirectRunner (which is 
typically used when developing pipelines)


On 17.10.18 15:57, Etienne Chauchot wrote:

Hey Max, Kenn,

Thanks for your feedback !

Yes the idea was to inform the user as soon as possible, ideally while 
coding the pipeline. It could be done with a IDE plugin (like 
checkstyle) that is configured with the targeted runner; that way the 
targeted runner conf is not part of the pipeline code in an annotation 
which would be against Beam portability philosophy. Such a plugin could 
color unsupported features while coding.


Another way could be to implement it as a javadoc but it seems weak 
because not automatic enough.
Another way could be to implement it as a validation plugin in the build 
system but IMHO it is already too late for the user.


So, long story short, I'm more in favor of an IDE plugin or similar 
coding-time solution.


Best
Etienne

Le mercredi 17 octobre 2018 à 12:11 +0200, Maximilian Michels a écrit :

This is a good idea. It needs to be fleshed out how the capability of a
Runner would be visible to the user (apart from the compatibility matrix).

A dry-run feature would be useful, i.e. the user can run an inspection
on the pipeline to see if it contains any features which are not
supported by the Runner.

On 17.10.18 00:03, Rui Wang wrote:
Sounds like a good idea.

Sounds like while coding, user gets a list to show if a feature is
supported on different runners. User can check the list for the answer.
Is my understanding correct? Will this approach become slow as number of
runner grows? (it's just a question as I am not familiar the performance
of combination of a long list, annotation and IDE)


-Rui

On Sat, Oct 13, 2018 at 11:56 PM Reuven Lax mailto:re...@google.com>  
<mailto:re...@google.com <mailto:re...@google.com>>> wrote:


 Sounds like a good idea. I don't think it will work for all
 capabilities (e.g. some of them such as "exactly once" apply to all
 of the API surface), but useful for the ones that we can capture.

 On Thu, Oct 4, 2018 at 2:43 AM Etienne Chauchot
 mailto:echauc...@apache.org>  <mailto:echauc...@apache.org 
<mailto:echauc...@apache.org>>> wrote:

 Hi guys,
 As part of our user experience improvement to attract new Beam
 users, I would like to suggest something:

 Today we only have the capability matrix to inform users about
 features support among runners. But, they might discover only
 when the pipeline runs, when they receive an exception, that a
 given feature is not supported by the targeted runner.
 I would like to suggest to translate the capability matrix into
 the API with annotations for example, so that, while coding, the
 user could know that, for now, a given feature is not supported
 on the runner he targets.

 I know that the runner is only specified at pipeline runtime,
 and that adding code would be a leak of runner implementation
 and against portability. So it could be just informative
 annotations like @Experimental for example with no annotation
 processor.

 WDYT?

 Etienne

Re: [PROPOSAL] allow the users to anticipate the support of features in the targeted runner.

2018-10-18 Thread Maximilian Michels

Similar to how we have `validate()` on the Pipeline to check the 
pipeline specification, dry-run would check the pipeline translation and 
report errors back to the user.


Assuming that Runners throw errors for unsupported features, that would 
already give users confidence that they will be able to run their 
pipelines with a specific Runner.


On 17.10.18 15:28, Robert Bradshaw wrote:
On Wed, Oct 17, 2018 at 3:17 PM Kenneth Knowles <mailto:k...@apache.org>> wrote:


On Wed, Oct 17, 2018 at 3:12 AM Maximilian Michels mailto:m...@apache.org>> wrote:

A dry-run feature would be useful, i.e. the user can run an
inspection
on the pipeline to see if it contains any features which are not
supported by the Runner.


This seems extremely useful independent of an annotation processor
(which also seems useful), and pretty easy to get done quickly.


+1, this would be very useful. (It could also be useful for cheaper 
testing of the dataflow, or other non-local, runners.)


On 17.10.18 00:03, Rui Wang wrote:
 > Sounds like a good idea.
 >
 > Sounds like while coding, user gets a list to show if a
feature is
 > supported on different runners. User can check the list for
the answer.
 > Is my understanding correct? Will this approach become slow
as number of
 > runner grows? (it's just a question as I am not familiar the
performance
 > of combination of a long list, annotation and IDE)
 >
 >
 > -Rui
 >
 > On Sat, Oct 13, 2018 at 11:56 PM Reuven Lax mailto:re...@google.com>
 > <mailto:re...@google.com <mailto:re...@google.com>>> wrote:
 >
 >     Sounds like a good idea. I don't think it will work for all
 >     capabilities (e.g. some of them such as "exactly once"
apply to all
 >     of the API surface), but useful for the ones that we can
capture.
 >
 >     On Thu, Oct 4, 2018 at 2:43 AM Etienne Chauchot
 >     mailto:echauc...@apache.org>
<mailto:echauc...@apache.org <mailto:echauc...@apache.org>>> wrote:
 >
 >         Hi guys,
 >         As part of our user experience improvement to attract
new Beam
 >         users, I would like to suggest something:
 >
 >         Today we only have the capability matrix to inform
users about
 >         features support among runners. But, they might
discover only
 >         when the pipeline runs, when they receive an
exception, that a
 >         given feature is not supported by the targeted runner.
 >         I would like to suggest to translate the capability
matrix into
 >         the API with annotations for example, so that, while
coding, the
 >         user could know that, for now, a given feature is not
supported
 >         on the runner he targets.
 >
 >         I know that the runner is only specified at pipeline
runtime,
 >         and that adding code would be a leak of runner
implementation
 >         and against portability. So it could be just informative
 >         annotations like @Experimental for example with no
annotation
 >         processor.
 >
 >         WDYT?
 >
 >         Etienne
 >

Re: [Call for items] October Beam Newsletter

2018-10-16 Thread Maximilian Michels


Hi Rose,

A bit late but since the newsletter does not seem to be out yet, I added 
some items for the Portable Flink Runner.


Cheers,
Max

On 08.10.18 18:59, Rose Nguyen wrote:

Hi Beamers:

So much has been going on that it's time to sync up again in the October 
Beam Newsletter [1]! :)


*Add the highlights from September to now (or planned events and talks) 
that you want to share with the community by 10/14 11:59 p.m. **PDT.*

*
*
We will collect the notes via Google docs but send out the final version 
directly to the user mailing list. If you do not know how to format 
something, it is OK to just put down the info and I will edit. I'll ship 
out the newsletter on 10/15.


[1] 
https://docs.google.com/document/d/1KWk-pgq0_UR8PrJFstuRPb-dYtW4WspBwMJhfEPGYIM


--
Rose Thị Nguyễn

Re: Integrating Stateful DoFns from the Python SDK


Type hints turn out to be not so predictable:

1) WORKS
  p | beam.Impulse() \
| beam.ParDo(MyCreate()).with_output_types(typehints.KV[K, V]) \
| "statefulParDo" >> beam.ParDo(AddIndex())

2) DOES NOT (no KvCoder)
  p | beam.Create(inputs).with_output_types(typehints.KV[K, V]) \
| "statefulParDo" >> beam.ParDo(AddIndex())

Do you know a way to make 2) work, i.e. set the KvCoder for the Create?


In the first example, the Create runs in a ParDo, in the second example
On 17.10.18 15:34, Maximilian Michels wrote:
Thanks Robert. I was able to get it working by adding this to the 
transform before my stateful DoFn:


   .with_output_types(typehints.KV[K, V])

For some reason `.with_input_types(typehints.KV[K, V])` on my stateful 
DoFn did not work.


Until we enforce KV during pipeline construction, we will have to throw 
an informative exception in the Runner.


On 17.10.18 15:03, Robert Bradshaw wrote:
Yes, we should be enforcing keyness (and use of KeyCoder with) 
stateful DoFns, similar to what we do for GBKs. See e.g. 
https://github.com/apache/beam/pull/6304#issuecomment-421935375


(This possibly relates to a long-standing issue that the coder 
inference should be moved up into construction, or at least before we 
pass the graph to the runner.)


On Wed, Oct 17, 2018 at 2:52 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


    Hi everyone,

    While integrating portable state with the FlinkRunner, I hit a 
problem

    and wanted to get your opinion.

    Stateful DoFns require their input to be KV records. The reason for
    this
    is that state is isolated by key. The (non-portable) FlinkRunner uses
    Flink's `keyBy(key)` construct to partition state by key [1].

    That works fine for portable Java pipelines where we enforce the `KV`
    class for Stateful DoFns. After running tests with the Python SDK, I
    came to the conclusion that tuples, e.g. `(key, value)` which are 
used

    for KV functionality, do not go through the KvCoder but are encoded
    using a byte array encoder.

    How do we infer the key in the Runner from an opaque sequence of 
bytes?
    Should we also require the KvCoder for stateful DoFns in the 
Python SDK?


    Thanks,
    Max

    [1]

https://github.com/apache/beam/blob/22f59deaf2a0aa77d98f2f024ce4b2a399014382/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L471

Re: Integrating Stateful DoFns from the Python SDK

Thanks Robert. I was able to get it working by adding this to the
transform before my stateful DoFn:

.with_output_types(typehints.KV[K, V])

For some reason `.with_input_types(typehints.KV[K, V])` on my stateful
DoFn did not work.

Until we enforce KV during pipeline construction, we will have to throw
an informative exception in the Runner.

On 17.10.18 15:03, Robert Bradshaw wrote:
Yes, we should be enforcing keyness (and use of KeyCoder with) stateful
DoFns, similar to what we do for GBKs. See e.g.
https://github.com/apache/beam/pull/6304#issuecomment-421935375

(This possibly relates to a long-standing issue that the coder inference
should be moved up into construction, or at least before we pass the
graph to the runner.)

On Wed, Oct 17, 2018 at 2:52 PM Maximilian Michels <mailto:m...@apache.org>> wrote:

Hi everyone,

While integrating portable state with the FlinkRunner, I hit a problem
and wanted to get your opinion.

Stateful DoFns require their input to be KV records. The reason for
this
is that state is isolated by key. The (non-portable) FlinkRunner uses
Flink's `keyBy(key)` construct to partition state by key [1].

That works fine for portable Java pipelines where we enforce the `KV`
class for Stateful DoFns. After running tests with the Python SDK, I
came to the conclusion that tuples, e.g. `(key, value)` which are used
for KV functionality, do not go through the KvCoder but are encoded
using a byte array encoder.

How do we infer the key in the Runner from an opaque sequence of bytes?
Should we also require the KvCoder for stateful DoFns in the Python SDK?

Thanks,
Max

[1]

https://github.com/apache/beam/blob/22f59deaf2a0aa77d98f2f024ce4b2a399014382/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L471

Integrating Stateful DoFns from the Python SDK


Hi everyone,

While integrating portable state with the FlinkRunner, I hit a problem 
and wanted to get your opinion.


Stateful DoFns require their input to be KV records. The reason for this 
is that state is isolated by key. The (non-portable) FlinkRunner uses 
Flink's `keyBy(key)` construct to partition state by key [1].


That works fine for portable Java pipelines where we enforce the `KV` 
class for Stateful DoFns. After running tests with the Python SDK, I 
came to the conclusion that tuples, e.g. `(key, value)` which are used 
for KV functionality, do not go through the KvCoder but are encoded 
using a byte array encoder.


How do we infer the key in the Runner from an opaque sequence of bytes? 
Should we also require the KvCoder for stateful DoFns in the Python SDK?


Thanks,
Max

[1] 
https://github.com/apache/beam/blob/22f59deaf2a0aa77d98f2f024ce4b2a399014382/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L471

Re: Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #216

2018-10-23 Thread Maximilian Michels


I don't get the error locally when running:

  gradle :beam-sdks-python:lintPy27

Seems like there is a different configuration on Jenkins?

On 23.10.18 10:16, Apache Jenkins Server wrote:

See 


Changes:

[david.moravek] [BEAM-5297] Add propdeps-idea plugin.

[25622840+adude3141] remove usage of deprecated Task.leftShift(Closure) method

[25622840+adude3141] remove usage of deprecated TaskInputs.dir() with something 
that doesn't

[25622840+adude3141] remove usage of deprecated 
FileCollection.stopExecutionIfEmpty() method

[25622840+adude3141] slightly update spotless gradle plugin to get rid of 
deprectated call to

[robertwb] [BEAM-5792] Implement Create in terms of Impulse + Reshuffle.

[robertwb] [BEAM-5791] Improve Python SDK progress counters.

[github] [BEAM-5779] Increase pubsub IT pipeline duration

[amaliujia] [BEAM-5796] Test window_end of TUMBLE, HOP, SESSION

[github] [BEAM-5617] Fix Python 3 incompatibility in pickler.

[25622840+adude3141] remove javacc bad option warning 'grammar_encoding'

[gleb] [BEAM-5675] Fix RowCoder#verifyDeterministic

[mxm] [BEAM-5707] Fix ':beam-sdks-python:docs' target

[gleb] [BEAM-5675] Simplify RowCoder#verifyDeterministic

[scott] Export Grafana testing dashboards and improve README

[mwylde] [BEAM-5797] Ensure bundle factory is always closed on dispose()

--
[...truncated 32.66 MB...]
Task ':beam-website:buildDockerImage' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker build -t beam-website .
Successfully started process 'command 'docker''
Sending build context to Docker daemon  26.11MB
Step 1/7 : FROM ruby:2.5
2.5: Pulling from library/ruby
Digest: sha256:1952c6e03a10bf878f078ba93af5ea92fe0338ba6ad546dfa0a4a7203213f6ac
Status: Downloaded newer image for ruby:2.5
  ---> 1f6aca1e0959
Step 2/7 : WORKDIR /ruby
  ---> Using cache
  ---> 1887c501933e
Step 3/7 : RUN gem install bundler
  ---> Using cache
  ---> f05e3e3557d0
Step 4/7 : ADD Gemfile Gemfile.lock /ruby/
  ---> Using cache
  ---> 492b55665e3d
Step 5/7 : RUN bundle install --deployment --path $GEM_HOME
  ---> Using cache
  ---> 111d53a3a581
Step 6/7 : ENV LC_ALL C.UTF-8
  ---> Using cache
  ---> 7cc4fd4065c4
Step 7/7 : CMD sleep 3600
  ---> Using cache
  ---> d632a0311fe7
Successfully built d632a0311fe7
Successfully tagged beam-website:latest
:beam-website:buildDockerImage (Thread[Task worker for ':' Thread 9,5,main]) 
completed. Took 0.991 secs.
:beam-website:createDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) started.


Task :beam-website:createDockerContainer

Caching disabled for task ':beam-website:createDockerContainer': Caching has 
not been enabled for the task
Task ':beam-website:createDockerContainer' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command '/bin/bash''. Working directory: 
 
Command: /bin/bash -c docker create -v 
:/repo -u 
$(id -u):$(id -g)  beam-website
Successfully started process 'command '/bin/bash''
:beam-website:createDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) completed. Took 1.119 secs.
:beam-website:startDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) started.


Task :beam-website:startDockerContainer

Caching disabled for task ':beam-website:startDockerContainer': Caching has not 
been enabled for the task
Task ':beam-website:startDockerContainer' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker start 
ba98d579d35e3e7850bbe5584354dd1ef45fee53d084fa367f0e43f14069fc6f
Successfully started process 'command 'docker''
ba98d579d35e3e7850bbe5584354dd1ef45fee53d084fa367f0e43f14069fc6f
:beam-website:startDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) completed. Took 0.304 secs.
:beam-website:buildLocalWebsite (Thread[Task worker for ':' Thread 9,5,main]) 
started.


Task :beam-website:buildLocalWebsite

Build cache key for task ':beam-website:buildLocalWebsite' is 
948e2017be85825f2d2abfa21c0edcc3
Caching disabled for task ':beam-website:buildLocalWebsite': Caching has not 
been enabled for the task
Task ':beam-website:buildLocalWebsite' is not up-to-date because:
   No history is available.
Starting process 'command 'docker''. Working directory: 
 
Command: docker exec

Re: Python docs build error

2018-10-23 Thread Maximilian Michels

It looks like now the build is broken on Jenkins but runs fine on MacOs.

There is some inconsistency in how `:pylint27` runs across the two 
platforms.

Broken build: 
https://builds.apache.org/job/beam_Release_Gradle_NightlySnapshot/216/

On 22.10.18 19:01, Ruoyun Huang wrote:

To Colm's question.

We observed this issue as well and had discussions in a separate thread 
<https://issues.apache.org/jira/browse/BEAM-5793>, with Scott and Micah.

This issue was only reproduced on certain Linux environment.  MacOS does 
not have this error.  We also specifically ran the test on Jenkins, but 
could not reproduce it either.

On Mon, Oct 22, 2018 at 7:49 AM Colm O hEigeartaigh <mailto:cohei...@apache.org>> wrote:

Great, thanks! Out of curiosity, did the jenkins job for the initial
PR not detect the build failure?

Colm.

On Mon, Oct 22, 2018 at 2:29 PM Maximilian Michels mailto:m...@apache.org>> wrote:

Correction for the footnote:

[1] https://github.com/apache/beam/pull/6637

    On 22.10.18 15:24, Maximilian Michels wrote:
 > Hi Colm,
 >
 > This [1] got merged recently and broke the "docs" target which
 > apparently is not part of our Python PreCommit tests.
 >
 > See the following PR for a fix:
https://github.com/apache/beam/pull/6774
 >
 > Best,
 > Max
 >
 > [1] https://github.com/apache/beam/pull/6737
 >
 > On 22.10.18 12:55, Colm O hEigeartaigh wrote:
 >> Hi all,
 >>
 >> The following command: ./gradlew :beam-sdks-python:docs
gives me the
 >> following error:
 >>
 >>

/home/coheig/src/apache/beam/sdks/python/apache_beam/io/flink/flink_streaming_impulse_source.py:docstring

 >> of
 >>

apache_beam.io.flink.flink_streaming_impulse_source.FlinkStreamingImpulseSource.from_runner_api_parameter:11:

 >> WARNING: Unexpected indentation.
 >> Command exited with non-zero status 1
 >> 42.81user 4.02system 0:16.27elapsed 287%CPU (0avgtext+0avgdata
 >> 141036maxresident)k
 >> 0inputs+47792outputs (0major+727274minor)pagefaults 0swaps
 >> ERROR: InvocationError for command '/usr/bin/time
 >>
/home/coheig/src/apache/beam/sdks/python/scripts/generate_pydoc.sh'
 >> (exited with code 1)
 >> ___ summary
 >> 
 >> ERROR:   docs: commands failed
 >>
 >>  > Task :beam-sdks-python:docs FAILED
 >>
 >> FAILURE: Build failed with an exception.
 >>
 >> Am I missing something or is there an issue here?
 >>
 >> Thanks,
 >>
 >> Colm.
 >>
 >>
 >> --
 >> Colm O hEigeartaigh
 >>
 >> Talend Community Coder
 >> http://coders.talend.com

-- 
Colm O hEigeartaigh

Talend Community Coder
http://coders.talend.com

--

Ruoyun  Huang

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-23 Thread Maximilian Michels


Looks great, Kenn!


Max: what is the story behind having a separate flink-shaded repo? Did that 
make it easier to manage in some way?


Better separation of concerns, but I don't think releasing the shaded 
artifacts from the main repo is a problem. I'd even prefer not to split 
up the repo because it makes updates to the vendored dependencies 
slightly easier.


On 23.10.18 03:25, Kenneth Knowles wrote:
OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to 
collect sub-tasks. This has enough upsides throughout lots of areas of 
the project that even though it is not glamorous it seems pretty 
valuable to start on immediately. And I want to find out if there's a 
pitfall lurking.


Max: what is the story behind having a separate flink-shaded repo? Did 
that make it easier to manage in some way?


Kenn

On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


+1 for publishing vendored Jars independently. It will improve build
time and ease IntelliJ integration.

Flink also publishes shaded dependencies separately:

- https://github.com/apache/flink-shaded
- https://issues.apache.org/jira/browse/FLINK-6529

AFAIK their main motivation was to get rid of duplicate shaded classes
on the classpath. We don't appear to have that problem because we
already have a separate "vendor" project.

 >  - With shading, it is hard (impossible?) to step into dependency
code in IntelliJ's debugger, because the actual symbol at runtime
does not match what is in the external jars

This would be solved by releasing the sources of the shaded jars.
 From a
legal perspective, this could be problematic as alluded to here:
https://github.com/apache/flink-shaded/issues/25

-Max

On 20.10.18 01:11, Lukasz Cwik wrote:
 > I have tried several times to improve the build system and intellij
 > integration and each attempt ended with little progress when dealing
 > with vendored code. My latest attempt has been the most promising
where
 > I take the vendored classes/jars and decompile them generating the
 > source that Intellij can then use. I have a branch[1] that
demonstrates
 > the idea. It works pretty well (and up until a change where we
started
 > vendoring gRPC, was impractical to do. Instructions to try it out
are:
 >
 > // Clean up any remnants of prior builds/intellij projects
 > git clean -fdx
 > // Generated the source for vendored/shaded modules
 > ./gradlew decompile
 >
 > // Remove the "generated" Java sources for protos so they don't
conflict with the decompiled sources.
 > rm -rf model/pipeline/build/generated/source/proto
 > rm -rf model/job-management/build/generated/source/proto
 > rm -rf model/fn-execution/build/generated/source/proto
 > // Import the project into Intellij, most code completion now
works still some issues with a few classes.
 > // Note that the Java decompiler doesn't generate valid source so
still need to delegate to Gradle for build/run/test actions
 > // Other decompilers may do a better/worse job but haven't tried
them.
 >
 >
 > The problems that I face are that the generated Java source from the
 > protos and the decompiled source from the compiled version of that
 > source post shading are both being imported as content roots and
then
 > conflict. Also, the CFR decompiler isn't producing valid source, if
 > people could try others and report their mileage, we may find one
that
 > works and then we would be able to use intellij to build/run our
code
 > and not need to delegate all our build/run/test actions to Gradle.
 >
 > After all these attempts I have done, vendoring the dependencies
outside
 > of the project seems like a sane approach and unless someone
wants to
 > take a stab at the best progress I have made above, I would go
with what
 > Kenn is suggesting even though it will mean that we will need to
perform
 > releases every time we want to change the version of one of our
vendored
 > dependencies.
 >
 > 1: https://github.com/lukecwik/incubator-beam/tree/intellij
 >
 >
 > On Fri, Oct 19, 2018 at 10:43 AM Kenneth Knowles mailto:k...@apache.org>
 > <mailto:k...@apache.org <mailto:k...@apache.org>>> wrote:
 >
 >     Another reason to push on this is to get build times down.
Once only
 >     generated proto classes use the shadow plugin we'll cut the build
 >     time in ~half? And there is no reason to constantly re-vendor.
 >
 >     Kenn
 >
 >     On Fri, Oct 19, 2018 at 10:39 AM Kenneth Knowles
mailto:k...

Re: [DISCUSS] Publish vendored dependencies independently

Would also keep it simple and optimize for the JAR size only if necessary.

On 24.10.18 00:06, Kenneth Knowles wrote:
I think it makes sense for each vendored dependency to be self-contained 
as much as possible. It should keep it fairly simple. Things that cross 
their API surface cannot be hidden, of course. Jar size is not a concern 
IMO.

Kenn

On Tue, Oct 23, 2018 at 9:05 AM Lukasz Cwik <mailto:lc...@google.com>> wrote:

How should we handle the transitive dependencies of the things we
want to vendor?

For example we use gRPC which depends on Guava 20 and we also use
Calcite which depends on Guava 19.

Should the vendored gRPC/Calcite/... be self-contained so it
contains all its dependencies, hence vendored gRPC would contain
Guava 20 and vendored Calcite would contain Guava 19 (both under
different namespaces)?
This leads to larger jars but less vendored dependencies to maintain.

Or should we produce a vendored library for those that we want to
share, e.g. Guava 20 that could be reused across multiple vendored
libraries?
Makes the vendoring process slightly more complicated, more
dependencies to maintain, smaller jars.

Or should we produce a vendored library for each dependency?
Lots of vendoring needed, likely tooling required to be built to
maintain this.

On Tue, Oct 23, 2018 at 8:46 AM Kenneth Knowles mailto:k...@google.com>> wrote:

I actually created the subtasks by finding things shaded by at
least one module. I think each one should definitely have an on
list discussion that clarifies the target artifact, namespace,
version, possible complications, etc.

My impression is that many many modules shade only Guava. So for
build time and simplification that is a big win.

Kenn

On Tue, Oct 23, 2018, 08:16 Thomas Weise mailto:t...@apache.org>> wrote:

+1 for separate artifacts

I would request that we explicitly discuss and agree which
dependencies we vendor though.

Not everything listed in the JIRA subtasks is currently
relocated.

Thomas

On Tue, Oct 23, 2018 at 8:04 AM David Morávek
mailto:david.mora...@gmail.com>>
wrote:

+1 This should improve build times a lot. It would be
great if vendored deps could stay in the main repository.

D.

On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels
mailto:m...@apache.org>> wrote:

Looks great, Kenn!

 > Max: what is the story behind having a separate
flink-shaded repo? Did that make it easier to manage
in some way?

Better separation of concerns, but I don't think
releasing the shaded
artifacts from the main repo is a problem. I'd even
prefer not to split
up the repo because it makes updates to the vendored
dependencies
slightly easier.

On 23.10.18 03:25, Kenneth Knowles wrote:
 > OK, I've filed
https://issues.apache.org/jira/browse/BEAM-5819 to
 > collect sub-tasks. This has enough upsides
throughout lots of areas of
 > the project that even though it is not glamorous
it seems pretty
 > valuable to start on immediately. And I want to
find out if there's a
 > pitfall lurking.
 >
 > Max: what is the story behind having a separate
flink-shaded repo? Did
 > that make it easier to manage in some way?
 >
 > Kenn
 >
 > On Mon, Oct 22, 2018 at 2:55 AM Maximilian
Michels mailto:m...@apache.org>
 > <mailto:m...@apache.org <mailto:m...@apache.org>>>
wrote:
 >
 >     +1 for publishing vendored Jars
independently. It will improve build
 >     time and ease IntelliJ integration.
 >
 >     Flink also publishes shaded dependencies
separately:
 >
 >     - https://github.com/apache/flink-shaded
 >     -
https://issues.apache.org/jira/browse/FLINK-6529
 >
 >     AFAIK their main motivation was to get rid of
duplicate s

Re: Unbalanced FileIO writes on Flink

The FlinkRunner uses a hash function (MurmurHash) on each key which 
places keys somewhere in the hash space. The hash space (2^32) is split 
among the partitions (5 in your case). Given enough keys, the chance 
increases they are equally spread.

This should be similar to what the other Runners do.

On 24.10.18 10:58, Jozef Vilcek wrote:

So if I run 5 workers with 50 shards, I end up with:

DurationBytes receivedRecords received
  2m 39s        900 MB            465,525
  2m 39s       1.76 GB            930,720
  2m 39s        789 MB            407,315
  2m 39s       1.32 GB            698,262
  2m 39s        788 MB            407,310

Still not good but better than with 5 shards where some workers did not 
participate at all.

So, problem is in some layer which distributes keys / shards among workers?

On Wed, Oct 24, 2018 at 9:37 AM Reuven Lax <mailto:re...@google.com>> wrote:

withNumShards(5) generates 5 random shards. It turns out that
statistically when you generate 5 random shards and you have 5
works, the probability is reasonably high that some workers will get
more than one shard (and as a result not all workers will
participate). Are you able to set the number of shards larger than 5?

On Wed, Oct 24, 2018 at 12:28 AM Jozef Vilcek mailto:jozo.vil...@gmail.com>> wrote:

cc (dev)

I tried to run the example with FlinkRunner in batch mode and
received again bad data spread among the workers.

When I tried to remove number of shards for batch mode in above
example, pipeline crashed before launch

Caused by: java.lang.IllegalStateException: Inputs to Flatten
had incompatible triggers:

AfterWatermark.pastEndOfWindow().withEarlyFirings(AfterPane.elementCountAtLeast(4)).withLateFirings(AfterFirst.of(Repeatedly.forever(AfterPane.elem
entCountAtLeast(1)),

Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(1
hour,

AfterWatermark.pastEndOfWindow().withEarlyFirings(AfterPane.elementCountAtLeast(1)).withLateFirings(AfterFirst.of(Repeatedly.fo
rever(AfterPane.elementCountAtLeast(1)),

Repeatedly.forever(AfterSynchronizedProcessingTime.pastFirstElementInPane(

On Tue, Oct 23, 2018 at 12:01 PM Jozef Vilcek
mailto:jozo.vil...@gmail.com>> wrote:

Hi Max,

I forgot to mention that example is run in streaming mode,
therefore I can not do writes without specifying shards.
FileIO explicitly asks for them.

I am not sure where the problem is. FlinkRunner is only one
I used.

On Tue, Oct 23, 2018 at 11:27 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

Hi Jozef,

This does not look like a FlinkRunner related problem,
but is caused by
the `WriteFiles` sharding logic. It assigns keys and
does a Reshuffle
which apparently does not lead to good data spread in
your case.

Do you see the same behavior without `withNumShards(5)`?

Thanks,
Max

On 22.10.18 11:57, Jozef Vilcek wrote:
 > Hello,
 >
 > I am having some trouble to get a balanced write via
FileIO. Workers at
 > the shuffle side where data per window fire are
written to the
 > filesystem receive unbalanced number of events.
 >
 > Here is a naive code example:
 >
 >      val read = KafkaIO.read()
 >          .withTopic("topic")
 >          .withBootstrapServers("kafka1:9092")
 > 
.withKeyDeserializer(classOf[ByteArrayDeserializer])
 > 
.withValueDeserializer(classOf[ByteArrayDeserializer])

 >          .withProcessingTime()
 >
 >      pipeline
 >          .apply(read)
 >          .apply(MapElements.via(new
 > SimpleFunction[KafkaRecord[Array[Byte], Array[Byte]],
String]() {
 >            override def apply(input:
KafkaRecord[Array[Byte],
 > Array[Byte]]): String = {
 >              new String(input.getKV.getValue, "UTF-8")
 >            }
 >          }))
 >
 >
 >

.apply(Window.into[String](FixedWindows.of(Duration.standardHours(1)))
 >              .triggeri

Re: Data Preprocessing in Beam

Welcome Alejandro! Interesting work. The sketching extension looks like 
a good place for your algorithms.


-Max

On 23.10.18 19:05, Lukasz Cwik wrote:
Arnoud Fournier (afourn...@talend.com ) 
started by adding a library to support sketching 
(https://github.com/apache/beam/tree/master/sdks/java/extensions/sketching), 
I feel as those some of these could be added there or possibly within 
another extension.


On Tue, Oct 23, 2018 at 9:54 AM Austin Bennett 
mailto:whatwouldausti...@gmail.com>> wrote:


Hi Beam Devs,

Alejandro, copied, is an enthusiastic developer, who recently coded up:
https://github.com/elbaulp/DPASF (associated paper found:
https://arxiv.org/abs/1810.06021).

He had been looking to contribute that code to FlinkML, at which
point I found him and alerted him to Beam.  He has been learning a
bit on Beam recently.  Would this data-preprocessing be a welcome
contribution to the project.  If yes, perhaps others better versed
in internals (I'm not there yet -- though could follow along!) would
be willing to provide feedback to shape this to be a suitable Beam
contribution.

Cheers,
Austin

Re: [VOTE] Release 2.8.0, release candidate #1

I've run WordCount using Quickstart with the FlinkRunner (locally and 
against a Flink cluster).


Would give a +1 but waiting what Kenn finds.

-Max

On 23.10.18 07:11, Ahmet Altay wrote:



On Mon, Oct 22, 2018 at 10:06 PM, Kenneth Knowles > wrote:


You two did so much verification I had a hard time finding something
where my help was meaningful! :-)

I did run the Nexmark suite on the DirectRunner against 2.7.0 and
2.8.0 following

https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local

.

It is admittedly a very silly test - the instructions leave
immutability enforcement on, etc. But it does appear that there is a
30% degradation in query 8 and 15% in query 9. These are the pure
Java tests, not the SQL variants. The rest of the queries are close
enough that differences are not meaningful.


(It would be a good improvement for us to have alerts on daily 
benchmarks if we do not have such a concept already.)



I would ask a little more time to see what is going on here - is it
a real performance issue or an artifact of how the tests are
invoked, or ...?


Thank you! Much appreciated. Please let us know when you are done with 
your investigation.



Kenn

On Mon, Oct 22, 2018 at 6:20 PM Ahmet Altay mailto:al...@google.com>> wrote:

Hi all,

Did you have a chance to review this RC? Between me and Robert
we ran a significant chunk of the validations. Let me know if
you have any questions.

Ahmet

On Thu, Oct 18, 2018 at 5:26 PM, Ahmet Altay mailto:al...@google.com>> wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the
version 2.8.0, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific
comments)

The complete staging area is available for your review,
which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to
dist.apache.org  [2], which is
signed with the key with fingerprint 6096FA00 [3],
* all artifacts to be deployed to the Maven Central
Repository [4],
* source code tag "v2.8.0-RC1" [5],
* website pull request listing the release and publishing
the API reference manual [6].
* Python artifacts are deployed along with the source
release to the dist.apache.org  [2].
* Validation sheet with a tab for 2.8.0 release to help with
validation [7].

The vote will be open for at least 72 hours. It is adopted
by majority approval, with at least 3 PMC affirmative votes.

Thanks,
Ahmet

[1]

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343985


[2] https://dist.apache.org/repos/dist/dev/beam/2.8.0

[3] https://dist.apache.org/repos/dist/dev/beam/KEYS

[4]

https://repository.apache.org/content/repositories/orgapachebeam-1049/


[5] https://github.com/apache/beam/tree/v2.8.0-RC1

[6] https://github.com/apache/beam-site/pull/583
 and
https://github.com/apache/beam/pull/6745

[7]

https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1854712816

Re: Unbalanced FileIO writes on Flink

Actually, I don't think setting the number of shards by the Runner will 
solve the problem. The shuffling logic still remains. And, as observed 
by Jozef, it doesn't necessarily lead to balanced shards.

The sharding logic of the Beam IO is handy but it shouldn't be strictly 
necessary when the data is already partitioned nicely.

It seems the sharding logic is primarily necessary because there is no 
notion of a worker's ID in Beam. In Flink, you can retrieve the worker 
ID at runtime and every worker just directly writes its results to a 
file, suffixed by its worker id. This avoids any GroupByKey or Reshuffle.

Robert, don't we already have Reshuffle which can be overriden? However, 
it is not used by the WritesFiles code.

-Max

On 26.10.18 11:41, Robert Bradshaw wrote:
I think it's worth adding a URN for the operation of distributing 
"evenly" into an "appropriate" number of shards. A naive implementation 
would add random keys and to a ReshufflePerKey, but runners could 
override this to do a reshuffle and then key by whatever notion of 
bundle/worker/shard identifier they have that lines up with the number 
of actual workers.

On Fri, Oct 26, 2018 at 11:34 AM Jozef Vilcek <mailto:jozo.vil...@gmail.com>> wrote:

Thanks for the JIRA. If I understand it correctly ... so runner
determined sharding will avoid extra shuffle? Will it just write
worker local available data to it's shard? Something similar to
coalesce in Spark?

On Fri, Oct 26, 2018 at 11:26 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Oh ok, thanks for the pointer. Coming from Flink, the default is
that
the sharding is determined by the runtime distribution. Indeed,
we will
have to add an overwrite to the Flink Runner, similar to this one:

https://github.com/apache/beam/commit/cbb922c8a72680c5b8b4299197b515abf650bfdf#diff-a79d5c3c33f6ef1c4894b97ca907d541R347

Jira issue: https://issues.apache.org/jira/browse/BEAM-5865

Thanks,
Max

On 25.10.18 22:37, Reuven Lax wrote:
 > FYI the Dataflow runner automatically sets the default number
of shards
 > (I believe to be 2 * num_workers). Probably we should do
something
 > similar for the Flink runner.
 >
 > This needs to be done by the runner, as # of workers is a runner
 > concept; the SDK itself has no concept of workers.
 >
 > On Thu, Oct 25, 2018 at 3:28 AM Jozef Vilcek
mailto:jozo.vil...@gmail.com>
 > <mailto:jozo.vil...@gmail.com
<mailto:jozo.vil...@gmail.com>>> wrote:
 >
 >     If I do not specify shards for unbounded collection, I get
 >
 >     Caused by: java.lang.IllegalArgumentException: When applying
 >     WriteFiles to an unbounded PCollection, must specify
number of
 >     output shards explicitly
 >              at
 >     org.apache.beam.repackaged.beam_sdks_java_core.com

<http://beam_sdks_java_core.com>.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
 >              at
 > org.apache.beam.sdk.io
<http://org.apache.beam.sdk.io>.WriteFiles.expand(WriteFiles.java:289)
 >
 >     Around same lines in WriteFiles is also a check for
windowed writes.
 >     I believe FileIO enables it explicitly when windowing is
present. In
 >     filesystem written files are per window and shard.
 >
 >     On Thu, Oct 25, 2018 at 12:01 PM Maximilian Michels
mailto:m...@apache.org>
 >     <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >         I agree it would be nice to keep the current
distribution of
 >         elements
 >         instead of doing a shuffle based on an artificial
shard key.
 >
 >         Have you tried `withWindowedWrites()`? Also, why do
you say you
 >         need to
 >         specify the number of shards in streaming mode?
 >
 >         -Max
 >
 >         On 25.10.18 10:12, Jozef Vilcek wrote:
 >          > Hm, yes, this makes sense now, but what can be
done for my
 >         case? I do
 >          > not want to end up with too many files on disk.
 >          >
 >          > I think what I am looking for is to instruct IO
that do not
 >         do again
 >          > random shard and reshuffle but just assume number
of shards
     >         equal to
 >          > number of workers and shard ID is a

Re: Beam Community Metrics

Hi Scott,

Thanks for sharing the progress. The test metrics are super helpful. I'm
particularly looking forward to the PR metrics which could be useful for
improving interaction within the community and with new contributors.

-Max

On 26.10.18 07:36, Scott Wegner wrote:
I want to summarize some of the great work done this summer by Mikhail,
Udi, and Huygaa to visualize and track some project/community health
metrics for Beam. Specifically, they've helped to build dashboards for:

* Test suite health (pre-commit speed, post-commit reliability)
* Pull Request health (code review latency, PR load per reviewer)

Check it out here: https://s.apache.org/beam-community-metrics, and
please leave feedback on this thread or under our umbrella JIRA item:
BEAM-5862.

There's some new infrastructure behind this which is hosted alongside
our Jenkins resources on Google Cloud. I want to ensure this doesn't
become a burden for the community, so I've written up a maintenance plan
here: https://s.apache.org/beam-community-metrics-infra. That link
contains more details on the metrics pipeline architecture components,
the design discussions which lead to building them, and my proposal for
documenting and monitoring the infrastructure.

There was a ton of discussion [1][2][3] that helped shape the dashboards
we've come up with. There's a whole lot we didn't get to, but the source
code is documented and checked-in [4], and I encourage others in the
community to add to it.

Thanks,
Scott

[1]
https://lists.apache.org/thread.html/b73cc4f0f05f4654eed2250aa95f205e7ab45253d98add0240911031@%3Cdev.beam.apache.org%3E
[2]
https://lists.apache.org/thread.html/3bb4aa51da2e2d7e22666aa6a2e18ae31891cb09d91718b75e74@%3Cdev.beam.apache.org%3E
[3]
https://lists.apache.org/thread.html/6cc942a34867ce7603392246c518c35410e828e9d2f17fdc547576ea@%3Cdev.beam.apache.org%3E

[4] https://github.com/apache/beam/tree/master/.test-infra/metrics

Got feedback? tinyurl.com/swegner-feedback

Re: Growing Beam -- A call for ideas? What is missing? What would be good to see?


Hi Austin,

Great initiative. I think there are already some materials out there but 
they are not consolidated:


Cookbook with examples: 
https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/cookbook


An interactive tutorial would be a great addition, perhaps the examples 
also need an update to reflect more typical use cases.


Documentation resources: https://beam.apache.org/documentation/resources/

IMHO there are three types of resources that would be useful:

1) Learning to write Beam pipelines
2) Beam Success Stories
3) Contributing to Beam

1) and 3) could need an overhaul. We clearly lack 2), not that there are 
no success stories but they are not collected yet.


Cheers,
Max

On 26.10.18 06:25, Austin Bennett wrote:

Hi Beam Devs and Users,

Trying to get a sense from the community on the sorts of things we think 
would be useful to build the community (I am thinking not from an angle 
of specific code/implementation/functionality, but from a user/usability 
-- I want to dive in and make real contributions with the code, too, but 
know I also have the interest and skills to help with education and 
community aspects, hence my focus on this).


I had previously suggested a sort of cookbook for focused and curated 
examples (code and explination) to help people get started, on-boarding, 
using Beam to aid getting up and running and accomplishing something 
worthwhile (and quickly), that seems one way to help grow our user base 
(and maybe future dev base afterwards those users become enamored), 
which did get some positive feedback when first put out there.


There are many other areas where featuring others sharing successes from 
having used Beam or little tips can be valuable, Pablo's Awesome Beam is 
one example of such a collection: 
https://github.com/pabloem/awesome-beam or even centralizing a general 
place to find any/all Beam blogs/shared-code/writeups/etc.


Certainly there is a place for all sorts of contributions and 
resources.  What do people on these lists think would be particularly 
useful?  Trying to get a more focused sense of where we think efforts 
might be best focused.


Please share anything (even semi-)related!?

Thanks,
Austin


P.S.  I realize that those following this list are rather self selecting 
as well, so this might not be the best forum to figure out what 
new/novice users need, but I would like to hear what everyone else here 
thinks could be useful.

Re: Python profiling

This looks very helpful for debugging performance of portable pipelines. 
Great work!


Enabling local directories for Flink or other portable Runners would be 
useful for debugging, e.g. per 
https://issues.apache.org/jira/browse/BEAM-5440


On 26.10.18 18:08, Robert Bradshaw wrote:

Now that we've (mostly) moved from features to performance for
BeamPython-on-Flink, I've been doing some profiling of Python code,
and thought it may be useful for others as well (both those working on
the SDK, and users who want to understand their own code), so I've
tried to wrap this up into something useful.

Python already had some existing profile options that we used with
Dataflow, specifically --profile_cpu and --profile_location. I've
hooked these up to both the DirectRunner and the SDK Harness Worker.
One can now run commands like

 python -m apache_beam.examples.wordcount
--output=counts.txt--profile_cpu --profile_location=path/to/directory

and get nice graphs like the one attached. (Here the bulk of the time
is spent reading from the default input in gcs. Another hint for
reading the graph is that due to fusion the call graph is cyclic,
passing through operations:86:receive for every output.)

The raw python profile stats [1] are produced in that directory, along
with a dot graph and an svg if both dot and gprof2dot are installed.
There is also an important option --direct_runner_bundle_repeat which
can be set to gain more accurate profiles on smaller data sets by
re-playing the bundle without the (non-trivial) one-time setup costs.

These flags also work on portability runners such as Flink, where the
directory must be set to a distributed filesystem. Each bundle
produces its own profile in that directory, and they can be
concatenated and manually fed into tools like below. In that case
there is a --profile_sample_rate which can be set to avoid profiling
every single bundle (e.g. for a production job).

The PR is up at https://github.com/apache/beam/pull/6847 Hope it's useful.

- Robert


[1] https://docs.python.org/2/library/profile.html
[2] https://github.com/jrfonseca/gprof2dot

Re: Unbalanced FileIO writes on Flink

I was suggesting a transform like reshuffle that can avoid the actual reshuffle 
if the data is already well distributed

How do we know if the data is already well-distributed? Can't we simply 
give the user control over the shuffling behavior?

and also provides some kind of unique key

Yes, that what I meant with the "subtask index" in Flink.

I don't recall why we made the choice of shard counts required in streaming 
mode. Perhaps because the bundles were to small (per key?) by default and we 
wanted to force more grouping?

The issue https://issues.apache.org/jira/browse/BEAM-1438 mentions too 
many files as the reason.

-Max

On 26.10.18 15:44, Robert Bradshaw wrote:

We can't use Reshuffle for this, as there may be other reasons the
user wants to actually force a reshuffle, but I was suggesting a
transform like reshuffle that can avoid the actual reshuffle if the
data is already well distributed, and also provides some kind of
unique key (though perhaps just choosing a random nonce in
start_bundle would be sufficient).

For sinks where we may need to retry writes, Reshuffle has been
(ab)used to provide stable inputs, but for file-based sinks, this does
not seem necessary. I don't recall why we made the choice of shard
counts required in streaming mode. Perhaps because the bundles were to
small (per key?) by default and we wanted to force more grouping?

On Fri, Oct 26, 2018 at 3:32 PM Maximilian Michels  wrote:

Actually, I don't think setting the number of shards by the Runner will
solve the problem. The shuffling logic still remains. And, as observed
by Jozef, it doesn't necessarily lead to balanced shards.

The sharding logic of the Beam IO is handy but it shouldn't be strictly
necessary when the data is already partitioned nicely.

It seems the sharding logic is primarily necessary because there is no
notion of a worker's ID in Beam. In Flink, you can retrieve the worker
ID at runtime and every worker just directly writes its results to a
file, suffixed by its worker id. This avoids any GroupByKey or Reshuffle.

Robert, don't we already have Reshuffle which can be overriden? However,
it is not used by the WritesFiles code.

-Max

On 26.10.18 11:41, Robert Bradshaw wrote:

I think it's worth adding a URN for the operation of distributing
"evenly" into an "appropriate" number of shards. A naive implementation
would add random keys and to a ReshufflePerKey, but runners could
override this to do a reshuffle and then key by whatever notion of
bundle/worker/shard identifier they have that lines up with the number
of actual workers.

On Fri, Oct 26, 2018 at 11:34 AM Jozef Vilcek mailto:jozo.vil...@gmail.com>> wrote:

 Thanks for the JIRA. If I understand it correctly ... so runner
 determined sharding will avoid extra shuffle? Will it just write
 worker local available data to it's shard? Something similar to
 coalesce in Spark?

     On Fri, Oct 26, 2018 at 11:26 AM Maximilian Michels mailto:m...@apache.org>> wrote:

 Oh ok, thanks for the pointer. Coming from Flink, the default is
 that
 the sharding is determined by the runtime distribution. Indeed,
 we will
 have to add an overwrite to the Flink Runner, similar to this one:

https://github.com/apache/beam/commit/cbb922c8a72680c5b8b4299197b515abf650bfdf#diff-a79d5c3c33f6ef1c4894b97ca907d541R347

 Jira issue: https://issues.apache.org/jira/browse/BEAM-5865

 Thanks,
 Max

 On 25.10.18 22:37, Reuven Lax wrote:
  > FYI the Dataflow runner automatically sets the default number
 of shards
  > (I believe to be 2 * num_workers). Probably we should do
 something
  > similar for the Flink runner.
  >
  > This needs to be done by the runner, as # of workers is a runner
  > concept; the SDK itself has no concept of workers.
  >
  > On Thu, Oct 25, 2018 at 3:28 AM Jozef Vilcek
 mailto:jozo.vil...@gmail.com>
  > <mailto:jozo.vil...@gmail.com
 <mailto:jozo.vil...@gmail.com>>> wrote:
  >
  > If I do not specify shards for unbounded collection, I get
  >
  > Caused by: java.lang.IllegalArgumentException: When applying
  > WriteFiles to an unbounded PCollection, must specify
 number of
  > output shards explicitly
  >  at
  > org.apache.beam.repackaged.beam_sdks_java_core.com

<http://beam_sdks_java_core.com>.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
  >  at
  > org.apache.beam.sdk.io
 <http://org.apache.beam.sdk.io>.WriteFiles.expand(WriteFiles.java:289)
  >
  > Around same lines in WriteFiles is also a check for
 wi

Accessing keyed state in portable timer callbacks

2018-10-31 Thread Maximilian Michels


Hi,

I have a question regarding user state during timer callback in the 
FnApiDoFnRunner (Java SDK Harness).


I've started implementing Timers for the portable Flink Runner. I can 
register a timer via the timer output collection and fire the timer via 
the timer input of the SDK Harness. But when I try to access state in 
the Timer callback, I get the exception below.


Is this a bug or if not, how is the timer's key supposed to be set? I 
assume that it should be set from the timer element which contains the key.


Thanks,
Max


Caused by: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: Error received from SDK harness for 
instruction 72: java.util.concurrent.ExecutionException: 
java.lang.NullPointerException
    at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
    at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
    at 
org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:49)
    at 
org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:90)
    at 
org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:185)
    at 
org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:292)
    at 
org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:161)
    at 
org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:145)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at 
org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:49694)
    at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:451)
    at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.bindBag(FnApiStateAccessor.java:244)
    at 
org.apache.beam.sdk.state.StateSpecs$BagStateSpec.bind(StateSpecs.java:487)
    at 
org.apache.beam.sdk.state.StateSpecs$BagStateSpec.bind(StateSpecs.java:477)
    at 
org.apache.beam.fn.harness.FnApiDoFnRunner$OnTimerContext.state(FnApiDoFnRunner.java:671)
    at 
StateTest$5$OnTimerInvoker$expiry$ZXhwaXJ5.invokeOnTimer(Unknown Source)
    at 
org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory$DoFnInvokerBase.invokeOnTimer(ByteBuddyDoFnInvokerFactory.java:187)
    at 
org.apache.beam.fn.harness.FnApiDoFnRunner.processTimer(FnApiDoFnRunner.java:244)
    at 
org.apache.beam.fn.harness.DoFnPTransformRunnerFactory.lambda$createRunnerForPTransform$0(DoFnPTransformRunnerFactory.java:134)
    at 
org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81)
    at 
org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32)
    at 
org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139)
    at 
org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125)
    at 
org.apache.beam.sdk.fn.stream.ForwardingClientResponseObserver.onNext(ForwardingClientResponseObserver.java:50)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:407)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:519)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at 
org.apache.beam.vendor.grpc.v1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)

Re: Accessing keyed state in portable timer callbacks

2018-11-01 Thread Maximilian Michels


Hi Lukasz,

Thanks for promptly fixing this [1]. I saw that the current element was 
not set correctly when timers are processed, but wanted to make sure any 
changes would be aligned with the harness processing model.


I think I favor the currentElementOrTimer approach because it makes 
things more explicit, but the solution is fine for now.


Thanks,
Max

[1] https://github.com/apache/beam/pull/6902

On 31.10.18 19:09, Lukasz Cwik wrote:

I filed https://issues.apache.org/jira/browse/BEAM-5930.

On Wed, Oct 31, 2018 at 10:22 AM Lukasz Cwik <mailto:lc...@google.com>> wrote:


That looks like a bug in the FnApiDoFnRunner.java

The FnApiStateAccessor is given a callback to get the current
element and it is not handling the case where the current element is
a timer.

callback:

https://github.com/apache/beam/blob/29c443162a2fe4c89d26336b30aa6e3a3bfbade8/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L212
where the current "element" gets set:

https://github.com/apache/beam/blob/29c443162a2fe4c89d26336b30aa6e3a3bfbade8/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L220
where the current "timer" gets set:

https://github.com/apache/beam/blob/29c443162a2fe4c89d26336b30aa6e3a3bfbade8/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L237

The easiest fix would be to have the callback return the first non
null from currentElement/currentTimer but longer term I think we'll
want a different solution. Alternatively, we could collapse
currentElement and currentTimer to be currentElementOrTimer which
would solve the accessor issue.

On Wed, Oct 31, 2018 at 9:50 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi,

I have a question regarding user state during timer callback in
the FnApiDoFnRunner (Java SDK Harness).

I've started implementing Timers for the portable Flink Runner.
I can register a timer via the timer output collection and fire
the timer via the timer input of the SDK Harness. But when I try
to access state in the Timer callback, I get the exception below.

Is this a bug or if not, how is the timer's key supposed to be
set? I assume that it should be set from the timer element which
contains the key.

Thanks,
Max


Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Error received from SDK harness for
instruction 72: java.util.concurrent.ExecutionException:
java.lang.NullPointerException
     at

java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
     at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
     at

org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:49)
     at

org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:90)
     at

org.apache.beam.fn.harness.BeamFnDataReadRunner.blockTillReadFinishes(BeamFnDataReadRunner.java:185)
     at

org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:292)
     at

org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:161)
     at

org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:145)
     at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
     at

org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:49694)
     at

org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:451)
     at

org.apache.beam.fn.harness.state.FnApiStateAccessor.bindBag(FnApiStateAccessor.java:244)
     at

org.apache.beam.sdk.state.StateSpecs$BagStateSpec.bind(StateSpecs.java:487)
     at

org.apache.beam.sdk.state.StateSpecs$BagStateSpec.bind(StateSpecs.java:477)
     at

org.apache.beam.fn.harness.FnApiDoFnRunner$OnTimerContext.state(FnApiDoFnRunner.java:671)
     at
StateTest$5$OnTimerInvoker$expiry$ZXhwaXJ5.invokeOnTimer(Unknown
Source)
     at

org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFa

Re: [VOTE] Release 2.8.0, release candidate #1


+1 (binding)

On 26.10.18 17:45, Kenneth Knowles wrote:

Nice. Thanks.

+1


On Fri, Oct 26, 2018 at 8:44 AM Robert Bradshaw <mailto:rober...@google.com>> wrote:


Thanks Tim!

This was my only hesitation, and sounds like we're in the clear here.

+1 (binding)
On Fri, Oct 26, 2018 at 5:05 PM Tim Robertson
mailto:timrobertson...@gmail.com>> wrote:
 >
 > A colleague and I tested on 2.7.0 and 2.8.0RC1:
 >
 > 1. Quickstart on Spark/YARN/HDFS (CDH 5.12.0) (commented in
spreadsheet)
 > 2. Our Avro to Avro pipelines on Spark/YARN/HDFS (note we
backport the un-merged BEAM-5036 fix in our code)
 > 3. Our Avro to Elasticsearch pipelines on Spark/YARN/HDFS
 >
 > Everything worked, and performance was similar on both.
 > We built using maven pointing at
https://repository.apache.org/content/repositories/orgapachebeam-1049/
 >
 > Based on this limited testing: +1
 >
 > Thank you to the release managers,
 > Tim
 >
 >
 > On Thu, Oct 25, 2018 at 7:21 PM Tim mailto:timrobertson...@gmail.com>> wrote:
 >>
 >> I can do some tests on Spark / YARN tomorrow (CEST timezone).
Sorry I’ve just been too busy to assist.
 >>
 >> Tim
 >>
 >> On 25 Oct 2018, at 18:59, Kenneth Knowles mailto:k...@apache.org>> wrote:
 >>
 >> I tried to do a more thorough job on this.
 >>
 >>  - I could not reproduce the slowdown in Query 9. I believe the
variance was simply high given the parameters and environment
 >>  - I saw the same slowdown in Query 8 when running as part of
the suite, but it vanished when I ran repeatedly on its own, so
again it is not good methodology probably
 >>
 >> We do have the dashboard at
https://apache-beam-testing.appspot.com/dashboard-admin though no
anomaly detection set up AFAIK.
 >>
 >>  - There is no issue easily visible in DirectRunner:
https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
 >>  - There is a notable degradation in Spark runner on 10/5 for
many queries.
https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
 >>  - Something minor happened for Dataflow around 10/1:
https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
 >>  - Flink runner seems to have had some fantastic improvements
:-)
https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384
 >>
 >> So if there is a blocker it would really be the Spark runner
perf changes. Of course, all these except Dataflow are using local
instances so may not be representative of larger scale AFAIK.
 >>
 >> Kenn
 >>
 >> On Wed, Oct 24, 2018 at 9:48 AM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >>>
 >>> I've run WordCount using Quickstart with the FlinkRunner
(locally and
 >>> against a Flink cluster).
 >>>
 >>> Would give a +1 but waiting what Kenn finds.
 >>>
 >>> -Max
 >>>
 >>> On 23.10.18 07:11, Ahmet Altay wrote:
 >>> >
 >>> >
 >>> > On Mon, Oct 22, 2018 at 10:06 PM, Kenneth Knowles
mailto:k...@apache.org>
 >>> > <mailto:k...@apache.org <mailto:k...@apache.org>>> wrote:
 >>> >
 >>> >     You two did so much verification I had a hard time
finding something
 >>> >     where my help was meaningful! :-)
 >>> >
 >>> >     I did run the Nexmark suite on the DirectRunner against
2.7.0 and
 >>> >     2.8.0 following
 >>> >

https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local
 >>> >   
  <https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local>.

 >>> >
 >>> >     It is admittedly a very silly test - the instructions leave
 >>> >     immutability enforcement on, etc. But it does appear that
there is a
 >>> >     30% degradation in query 8 and 15% in query 9. These are
the pure
 >>> >     Java tests, not the SQL variants. The rest of the queries
are close
 >>> >     enough that differences are not meaningful.
 >>> >
 >>> >
 >>> > (It would be a good improvement for us to have alerts on daily
 >>> > benchmarks if we do not have such a concept already.)
 >&g

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

e2"]
 > --foo="string value"

Due to shell escaping, one would have to write

--foo=\"string value\"

or actually, due to the space

--foo='"string value"'

or some other variation on that, which is really
unfortunate. (The JSON list/objects would need similar
quoting, but that's less surprising.) Also, does this mean
we'd only have one kind of number (not integer vs. float,
i.e. --parallelism=5.0 works)? I suppose that is JSON. 

Yes, I was suspecting that users would need to type the second
variant as \"...\" I found more burdensome then '"..."'

 > --foo=3.5 --foo=-4
 > --foo=true --foo=false
 > --foo=null
 > This also works if the flag is repeated, so --foo=3.5
--foo=-4 is [3.5, -4]

The thing that sparked this discussion was what to do when
unknown foo is repeated, but only one value given.

If the person only specifies one value, then they have to
disambiguate and put it in a list, only if they specify more
then one value will they have to turn it into a list.

I believe we could come up with other schemes on how to convert
unknown options to JSON where we prefer strings over non-string
types like null/true/false/numbers/list/object and require the
user to escape out of the string default but anything that is
too different from strict JSON would cause headaches when
attempting to explain the format to users. I think a happy
middle ground would be that we will only require escaping for
strings which are ambiguous, so things like true, null, false,
... to be treated as strings would require the user to escape them.

I'd prefer to avoid inferring the type of an unknown argument based
on its contents, which can lead to surprises. We could declare every
unknown type to be repeated string, and let any parsing/validation
occur on the runner. If desired, we could pass these around as a
single "runner options" dict that runners could inspect and use to
populate the actual dict rather than mixing parsed and unparsed
options.

 > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
mailto:t...@apache.org>> wrote:
 >>
 >> Discovering options from the job server seems preferable
over replicating runner options in SDKs.
 >>
 >> Runners evolve on their own, and with portability the
SDK does not need to know anything about the runner.
 >>
 >> Regarding --runner-option. It is true that this looks
less user friendly. On the other hand it eliminates the
possibility of name collisions.
 >>
 >> But if options are discovered, the SDK can perform full
validation. It would only be necessary to use explicit
scoping when there is ambiguity.
 >>
 >> Thomas
 >>
 >>
 >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >>>
 >>> Fetching options directly from the Runner's JobServer
seems like the
 >>> ideal solution. I agree with Robert that it creates
additional
 >>> complexity for SDK authors, so the `--runner-option`
flag would be an
 >>> easy and explicit way to specify additional Runner options.
 >>>
 >>> The format I prefer would be: --runner_option=key1=val1
 >>> --runner_option=key2=val2
 >>>
 >>> Now, from the perspective of end users, I think it is
neither convenient
 >>> nor reasonable to require the use of the
`--runner-option` flag. To the
 >>> user it seems nebulous why some pipeline options live
in the top-level
 >>> option namespace while others need to be nested within
an option. This
 >>> is amplified by there being two Runners the user needs
to be aware of,
 >>> i.e. PortableRunner and the actual Runner
(Dataflow/Flink/Spark..).
 >>>
 >>> I feel like we would eventually replicate all options
in the SDK because
 >>> otherwise users have to use the `--runner-option`, but
at least we can

Re: Flink 1.6 Support

2018-10-31 Thread Maximilian Michels


Hi Jins,

As Thomas mentioned, the Flink Runner has already been prepared for 
Flink 1.6, you just have to change the Flink version in the Gradle build 
file.


Of course this is not convenient because you can't fetch this version 
via Maven Central. So we're planning to release both versions:


https://issues.apache.org/jira/browse/BEAM-5419

Thanks,
Max

On 31.10.18 06:28, Jins George wrote:
Thank you Thomas.  Idea of providing different build targets  for 
runners is great, as it enables users to pick from a list of runner 
versions.


Thanks

Jins George

On 10/30/18 12:36 PM, Thomas Weise wrote:
There has not been any decision to move to 1.6.x for the next release 
yet.


There has been related general discussion about upgrading runners 
recently [1]


Overall we need to consider the support for newer Flink versions that 
users find (the Flink version in distributions and what users 
typically have in their deployment stacks). These upgrades are not 
automatic/cheap/fast, so there is a balance to strike.


The good news is that with Beam 2.8.0 you should be able to make a 
build for 1.6.x with just a version number change [2]  (Other compile 
differences have been cleaned up.)


[1] 
https://lists.apache.org/thread.html/0588ed783767991aa36b00b8529bbd29b3a8958ee6e82fca83ac2938@%3Cdev.beam.apache.org%3E
[2] 
https://github.com/apache/beam/blob/v2.8.0/runners/flink/build.gradle#L49



On Tue, Oct 30, 2018 at 10:50 AM Lukasz Cwik > wrote:


+dev 

On Tue, Oct 30, 2018 at 10:30 AM Jins George
mailto:jins.geo...@aeris.net>> wrote:

Hi Community,

Noticed that the Beam 2.8 release comes with flink  1.5.x
dependency.
Are there any plans to upgrade flink to  1.6.x  in next beam
release. (
I am looking for the better k8s  support in Flink 1.6)

Thanks,

Jins George

Re: [DISCUSS] Publish vendored dependencies independently

n Maven as follows:
groupId: org.apache.beam
artifactId: beam-vendor-grpc-v1_13_1
version: 1.0.0 (first version and subsequent versions such as
1.0.1 are only for patch upgrades that fix any shading issues we
may have had when producing the vendored jar)


On Wed, Oct 24, 2018 at 6:01 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

Would also keep it simple and optimize for the JAR size only
if necessary.

On 24.10.18 00:06, Kenneth Knowles wrote:
 > I think it makes sense for each vendored dependency to be
self-contained
 > as much as possible. It should keep it fairly simple.
Things that cross
 > their API surface cannot be hidden, of course. Jar size
is not a concern
 > IMO.
 >
 > Kenn
 >
 > On Tue, Oct 23, 2018 at 9:05 AM Lukasz Cwik
mailto:lc...@google.com>
 > <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
 >
 >     How should we handle the transitive dependencies of
the things we
 >     want to vendor?
 >
 >     For example we use gRPC which depends on Guava 20 and
we also use
 >     Calcite which depends on Guava 19.
 >
 >     Should the vendored gRPC/Calcite/... be
self-contained so it
 >     contains all its dependencies, hence vendored gRPC
would contain
 >     Guava 20 and vendored Calcite would contain Guava 19
(both under
 >     different namespaces)?
 >     This leads to larger jars but less vendored
dependencies to maintain.
 >
 >     Or should we produce a vendored library for those
that we want to
 >     share, e.g. Guava 20 that could be reused across
multiple vendored
 >     libraries?
 >     Makes the vendoring process slightly more
complicated, more
 >     dependencies to maintain, smaller jars.
 >
 >     Or should we produce a vendored library for each
dependency?
 >     Lots of vendoring needed, likely tooling required to
be built to
 >     maintain this.
 >
 >
 >
 >
 >     On Tue, Oct 23, 2018 at 8:46 AM Kenneth Knowles
mailto:k...@google.com>
 >     <mailto:k...@google.com <mailto:k...@google.com>>> wrote:
 >
 >         I actually created the subtasks by finding things
shaded by at
 >         least one module. I think each one should
definitely have an on
 >         list discussion that clarifies the target
artifact, namespace,
 >         version, possible complications, etc.
 >
 >         My impression is that many many modules shade
only Guava. So for
 >         build time and simplification that is a big win.
 >
 >         Kenn
 >
 >         On Tue, Oct 23, 2018, 08:16 Thomas Weise
mailto:t...@apache.org>
 >         <mailto:t...@apache.org <mailto:t...@apache.org>>>
wrote:
 >
 >             +1 for separate artifacts
 >
 >             I would request that we explicitly discuss
and agree which
 >             dependencies we vendor though.
 >
 >             Not everything listed in the JIRA subtasks is
currently
 >             relocated.
 >
 >             Thomas
 >
 >
 >             On Tue, Oct 23, 2018 at 8:04 AM David Morávek
 >             mailto:david.mora...@gmail.com>
<mailto:david.mora...@gmail.com
<mailto:david.mora...@gmail.com>>>
 >             wrote:
 >
 >                 +1 This should improve build times a lot.
It would be
 >                 great if vendored deps could stay in the
main repository.
 >
 >                 D.
 >
 >                 On Tue, Oct 23, 2018 at 12:21 PM
Maximilian Michels
 >                 mailto:m...@apache.org>
<mailto:

Re: [DISCUSS] Publish vendored dependencies independently


On 25.10.18 19:23, Lukasz Cwik wrote:



On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Question: How would a user end up with the same shaded dependency
twice?
The shaded dependencies are transitive dependencies of Beam and thus,
this shouldn't happen. Is this a safe-guard when running different
versions of Beam in the same JVM?


What I was referring to was that they aren't exactly the same dependency 
but slightly different versions of the same dependency. Since we are 
planning to vendor each dependency and its transitive dependencies as 
part of the same jar, we can have  vendor-A that contains shaded 
transitive-C 1.0 and vendor-B that contains transitive-C 2.0 both with 
different package prefixes. It can be that transitive-C 1.0 and 
transitive-C 2.0 can't be on the same classpath because they can't be 
perfectly shaded due to JNI, java reflection, magical property 
files/strings, ...




Ah yes. Get it. Thanks!

Re: New Edit button on beam.apache.org pages


Cool!

I guess the underlying change is that the website can now be edited 
through the main repository and we don't have to go through "beam-site"?


-Max

On 25.10.18 12:20, Alexey Romanenko wrote:

This is really cool feature! With a tab “Preview changes” it makes 
documentation updating much more easier to do.
Thanks a lot to Alan and Scott!


On 25 Oct 2018, at 09:48, Robert Bradshaw  wrote:

Very cool! Thanks!
On Thu, Oct 25, 2018 at 9:38 AM Connell O'Callaghan  wrote:


Alan and Scott thank you for this enhancement and for reducing the obstacles to 
entry/contribute for new (and existing) members of the BEAM community

On Wed, Oct 24, 2018 at 8:50 PM Kenneth Knowles  wrote:


This is a genius way to involve everyone who lands on the site! My first PR is 
about to open... :-)

Kenn

On Wed, Oct 24, 2018 at 8:47 PM Jean-Baptiste Onofré  wrote:


Sweet !!

Thanks !

Regards
JB

On 24/10/2018 23:24, Alan Myrvold wrote:

To make small documentation changes easier, there is now an Edit button
at the top right of the pages on https://beam.apache.org. This button
opens the source .md file on the master branch of the beam repository in
the github web editor. After making changes you can create a pull
request to ask to have it merged.

Thanks to Scott for the suggestion to add this in [BEAM-4431]


Let me know if you run into any issues.

Alan

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-11-05 Thread Maximilian Michels

The result shows that there is a demand for an LTS release.

+1 for using an existing release. How about six months for the initial
LTS release? I think it shouldn't be too long for the first one to give
us a chance to make changes to the model.

-Max

On 02.11.18 17:26, Ahmet Altay wrote:

Twitter vote concluded with 52 votes with the following results:
- 52% Stable LTS releases
- 46% Upgrade to latest release
- 2% Keep using older releases

This reads like another supporting evidence for making LTS releases. In
the light of this, what do you all think about Kenn's proposal of making
existing branch an LTS branch?

On Thu, Oct 25, 2018 at 4:18 PM, Ahmet Altay > wrote:

On Tue, Oct 23, 2018 at 3:03 PM, Kenneth Knowles mailto:k...@apache.org>> wrote:

Yes, user@ cannot reach new users, really. Twitter might, if we
have enough of adjacent followers to get it in front of the
right people. On the other hand, I find testimonials from
experience convincing in this case.

I agree I am not sure how much additional input we will get from a
twitter poll. Started one anyway
(https://twitter.com/ApacheBeam/status/1055598972423684096
). I used
Thomas's version as the basis and had to shorten it to fit the
character limits.

Kenn

On Tue, Oct 23, 2018 at 2:59 PM Ahmet Altay mailto:al...@google.com>> wrote:

On Tue, Oct 23, 2018 at 9:16 AM, Thomas Weise
mailto:t...@apache.org>> wrote:

On Mon, Oct 22, 2018 at 2:42 PM Ahmet Altay
mailto:al...@google.com>> wrote:

We attempted to collect feedback on the mailing
lists but did not get much input. From my experience
(mostly based on dataflow) there is a sizeable group
of users who are less interested in new features and
want a version that is stable, that does not have
security issues, major data integrity issues etc. In
Beam's existing release model that corresponds to
the latest release.

It would help a lot if we can hear the perspectives
of other users who are not present here through the
developers who work with them.

Perhaps user@ and Twitter are good ways to reach
relevant audience.

We tried user@ before did not get any feedback [1]. Polling
on twitter sounds like a good idea. Unless there is an
objection, I can start a poll with Thomas's proposed text as
is on Beam's twitter account.

[1]

https://lists.apache.org/thread.html/7d890d6ed221c722a95d9c773583450767b79ee0c0c78f48a56c7eba@%3Cuser.beam.apache.org%3E

A poll could look like this:

The Beam community is considering LTS (Long Term
Support) for selected releases. LTS releases would only
contain critical bug fixes (security, data integrity
etc.) and offer an alternative to upgrading to latest
Beam release with new features. Please indicate your
preference for Beam upgrades:

1) Always upgrading to the latest release because I need
latest features along with bug fixes
2) Interested to switch to LTS releases to obtain
critical fixes
3) Not upgrading (using older release for other reasons)

Re: [ANNOUNCE] New committer announcement, Euphoria edition

2018-11-02 Thread Maximilian Michels

Congrats David! Looking forward to seeing more awesome work on 
Euphoria/Beam.


-Max

On 02.11.18 09:23, Ismaël Mejía wrote:

Congratulations, and thanks for all the hard work on making Euphoria
Beam ready !
On Fri, Nov 2, 2018 at 12:06 AM Scott Wegner  wrote:


Congrats David!

On Thu, Nov 1, 2018 at 2:21 PM Thomas Weise  wrote:


Congrats!


On Thu, Nov 1, 2018 at 12:05 PM Łukasz Gajowy  wrote:


Congratulations! :)

czw., 1 lis 2018 o 17:52 Alan Myrvold  napisał(a):


Congrats! Euphoria for beam looks awesome.

On Thu, Nov 1, 2018 at 9:49 AM Ahmet Altay  wrote:


Congratulations!

On Thu, Nov 1, 2018 at 9:36 AM, Tim  wrote:


Congratulations and welcome!

Tim

On 1 Nov 2018, at 17:06, Matthias Baetens  wrote:

Congrats David!!!

On Thu, Nov 1, 2018, 16:04 Kenneth Knowles  wrote:


Hi all,

Please join me and the rest of the Beam PMC in welcoming a new committer:

  - David Morávek: of, but not limited to, the new Euphoria API

Through his work with us merging the Euphoria API, community outreach, and 
other contributions to Beam, the PMC trusts David with the responsibilities of 
a Beam committer [1].

Kenn

[1] 
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer


--







--




Got feedback? tinyurl.com/swegner-feedback

Re: Follow up ideas, to simplify creating MonitoringInfos.

2018-11-02 Thread Maximilian Michels


I was unable to get this to compile and could find no examples of this on the 
proto github.


Would it be helpful to post the compiler output?

-Max

On 31.10.18 19:19, Lukasz Cwik wrote:
I see and don't know how to help you beyond what your already 
suggesting. From what I remember, maps were added as syntactic sugar of 
lists of key value pairs.


On Tue, Oct 30, 2018 at 5:37 PM Alex Amato > wrote:


I am not sure on the correct syntax to populate the instances of my
MonitoringInfoSpec messages

message MonitoringInfoSpec {

string urn = 1;

string type_urn = 2;

repeated string required_labels = 3;

*map annotations = 4;*

}


Notice how the annotations field is not used anywhere. I was unable
to get this to compile and could find no examples of this on the
proto github. Perhaps I'll have to reach out to them. I was
wondering if anyone here was familiar first.


message MonitoringInfoSpecs {

enum MonitoringInfoSpecsEnum {

   USER_COUNTER = 0 [(monitoring_info_spec) = {

 urn: "beam:metric:user",

 type_urn: "beam:metrics:sum_int_64",

   }];


   ELEMENT_COUNT = 1 [(monitoring_info_spec) = {

 urn: "beam:metric:element_count:v1",

 type_urn: "beam:metrics:sum_int_64",

 required_labels: ["PTRANSFORM"],

   }];


   START_BUNDLE_MSECS = 2 [(monitoring_info_spec) = {

 urn: "beam:metric:pardo_execution_time:start_bundle_msecs:v1",

 type_urn: "beam:metrics:sum_int_64",

 required_labels: ["PTRANSFORM"],

   }];


   PROCESS_BUNDLE_MSECS = 3 [(monitoring_info_spec) = {

 urn: "beam:metric:pardo_execution_time:process_bundle_msecs:v1",

 type_urn: "beam:metrics:sum_int_64",

 required_labels: ["PTRANSFORM"],

   }];


   FINISH_BUNDLE_MSECS = 4 [(monitoring_info_spec) = {

 urn: "beam:metric:pardo_execution_time:finish_bundle_msecs:v1",

 type_urn: "beam:metrics:sum_int_64",

 required_labels: ["PTRANSFORM"],

   }];


   TOTAL_MSECS = 5 [(monitoring_info_spec) = {

 urn: "beam:metric:ptransform_execution_time:total_msecs:v1",

 type_urn: "beam:metrics:sum_int_64",

 required_labels: ["PTRANSFORM"],

   }];

}

}




On Tue, Oct 30, 2018 at 2:01 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

I'm not sure what you mean by "Using a map in an option."

For your second issue, the google docs around this show[1]:

Note that if you want to use a custom option in a package other
than the one in which it was defined, you must prefix the option
name with the package name, just as you would for type names.
For example:

// foo.proto
import "google/protobuf/descriptor.proto";
package foo;
extend google.protobuf.MessageOptions {
   optional string my_option = 51234;
}

// bar.proto
import "foo.proto";
package bar;
message MyMessage {
   option (foo.my_option) = "Hello world!";
}


1:
https://developers.google.com/protocol-buffers/docs/proto#customoptions


On Mon, Oct 29, 2018 at 5:19 PM Alex Amato mailto:ajam...@google.com>> wrote:

Hi Robert and community, :)

I was starting to code this up, but I wasn't sure exactly
how to do some of the proto syntax. Would you mind taking a
look at this doc


and let me know if you know how to resolve any of these issues:

  * Using a map in an option.
  * Referring to string "constants" from other enum options
in .proto files.

Please see the comments I have listed in the doc

,
and a few alternative suggestions.
Thanks

On Wed, Oct 24, 2018 at 10:08 AM Alex Amato
mailto:ajam...@google.com>> wrote:

Okay. That makes sense. Using runtime validation and
protos is what I was thinking as well.
I'll include you as a reviewer in my PRs.

As for the choice of a builder/constructor/factory. If
we go with factory methods/constructor then we will need
a method for each metric type (intCounter, latestInt64,
...). Plus, then I don't think we can have any
constructor parameters for labels, we will just need to
accept a dictionary for label keys+values like you say.
This is because we cannot offer a method for each URN
and its combination of labels, and we should avoid such
static

Re: Never get spotless errors with this one weird trick

2018-11-02 Thread Maximilian Michels


Scott just separated the spotless check from the Java unit test precommit job, 
so you get faster feedback on spotless errors.


Nice!

+1 for the pre-commit hook. Have it set up. Unfortunately, it doesn't 
work with the GitHub merge button.


Cheers,
Max

On 02.11.18 09:26, Ismaël Mejía wrote:

Nice idea, thanks for sharing and thanks Scott for separating this in the build.

On Thu, Nov 1, 2018 at 11:51 PM Alan Myrvold  wrote:


Thanks for the trick. I added it to 
https://cwiki.apache.org/confluence/display/BEAM/Java+Tips

On Thu, Nov 1, 2018 at 2:26 PM Ankur Goenka  wrote:


Thanks for sharing the trick.


On Thu, Nov 1, 2018 at 9:30 AM Kenneth Knowles  wrote:


Hi all,

Scott just separated the spotless check from the Java unit test precommit job, 
so you get faster feedback on spotless errors.

I wondered if there was a good place to just always reformat, and whether it 
was fast enough to be OK. The answer is yes, and yes.

You can set up a git precommit hook to always autoformat code, by putting this 
in .git/hooks/pre-commit and setting the executable bit.

 #!/bin/sh
 set -e
 ./gradlew spotlessApply

If you haven't used git hooks, the docs are here: 
https://git-scm.com/docs/githooks. I'll call out that --no-verify will skip it 
and `chmod u-x` will disable it.

Then testing the time:

  - From a fresh checkout ./gradlew spotlessJavaApply took 24s configuration 
and 49s spotlessApply
  - Then I modified one file in nexmark, messed up the formatting, and committed
  - The re-run took 1s in configuration and 4s in spotlessApply

So this will add ~5s of waiting each time you `git commit`. You can decide if it is worth 
it to you. If you are a "push a bunch of commits to be squashed" GitHub user, 
you could amortize it by making it a pre-push hook that adds a spotless commit (`git 
commit --fixup HEAD`).

Kenn

Re: Unbalanced FileIO writes on Flink

Oh ok, thanks for the pointer. Coming from Flink, the default is that 
the sharding is determined by the runtime distribution. Indeed, we will 
have to add an overwrite to the Flink Runner, similar to this one:


https://github.com/apache/beam/commit/cbb922c8a72680c5b8b4299197b515abf650bfdf#diff-a79d5c3c33f6ef1c4894b97ca907d541R347

Jira issue: https://issues.apache.org/jira/browse/BEAM-5865

Thanks,
Max

On 25.10.18 22:37, Reuven Lax wrote:
FYI the Dataflow runner automatically sets the default number of shards 
(I believe to be 2 * num_workers). Probably we should do something 
similar for the Flink runner.


This needs to be done by the runner, as # of workers is a runner 
concept; the SDK itself has no concept of workers.


On Thu, Oct 25, 2018 at 3:28 AM Jozef Vilcek <mailto:jozo.vil...@gmail.com>> wrote:


If I do not specify shards for unbounded collection, I get

Caused by: java.lang.IllegalArgumentException: When applying
WriteFiles to an unbounded PCollection, must specify number of
output shards explicitly
         at

org.apache.beam.repackaged.beam_sdks_java_core.com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
         at
org.apache.beam.sdk.io.WriteFiles.expand(WriteFiles.java:289)

Around same lines in WriteFiles is also a check for windowed writes.
I believe FileIO enables it explicitly when windowing is present. In
filesystem written files are per window and shard.

On Thu, Oct 25, 2018 at 12:01 PM Maximilian Michels mailto:m...@apache.org>> wrote:

I agree it would be nice to keep the current distribution of
elements
instead of doing a shuffle based on an artificial shard key.

Have you tried `withWindowedWrites()`? Also, why do you say you
need to
specify the number of shards in streaming mode?

-Max

On 25.10.18 10:12, Jozef Vilcek wrote:
 > Hm, yes, this makes sense now, but what can be done for my
case? I do
 > not want to end up with too many files on disk.
 >
 > I think what I am looking for is to instruct IO that do not
do again
 > random shard and reshuffle but just assume number of shards
equal to
 > number of workers and shard ID is a worker ID.
 > Is this doable in beam model?
 >
 > On Wed, Oct 24, 2018 at 4:07 PM Maximilian Michels
mailto:m...@apache.org>
 > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >     The FlinkRunner uses a hash function (MurmurHash) on each
key which
 >     places keys somewhere in the hash space. The hash space
(2^32) is split
 >     among the partitions (5 in your case). Given enough keys,
the chance
 >     increases they are equally spread.
 >
 >     This should be similar to what the other Runners do.
 >
 >     On 24.10.18 10:58, Jozef Vilcek wrote:
 >      >
 >      > So if I run 5 workers with 50 shards, I end up with:
 >      >
 >      > DurationBytes receivedRecords received
 >      >   2m 39s        900 MB            465,525
 >      >   2m 39s       1.76 GB            930,720
 >      >   2m 39s        789 MB            407,315
 >      >   2m 39s       1.32 GB            698,262
 >      >   2m 39s        788 MB            407,310
 >      >
 >      > Still not good but better than with 5 shards where
some workers
 >     did not
 >      > participate at all.
 >      > So, problem is in some layer which distributes keys /
shards
 >     among workers?
 >      >
 >      > On Wed, Oct 24, 2018 at 9:37 AM Reuven Lax
mailto:re...@google.com>
 >     <mailto:re...@google.com <mailto:re...@google.com>>
 >      > <mailto:re...@google.com <mailto:re...@google.com>
<mailto:re...@google.com <mailto:re...@google.com>>>> wrote:
 >      >
 >      >     withNumShards(5) generates 5 random shards. It
turns out that
 >      >     statistically when you generate 5 random shards
and you have 5
 >      >     works, the probability is reasonably high that
some workers
 >     will get
 >      >     more than one shard (and as a result not all
workers will
 >      >     participate). Are you able to set the number of
shards larger
 >     than 5?
 >      >
 >      >     On Wed, Oct 24, 2018 at 12:28 AM Jozef Vilcek

Re: Unbalanced FileIO writes on Flink