Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-03 Thread Brian Hulette
I've opened BEAM-10571 [1] for this, and I'm most of the way to an
implementation now. Aiming to have it done before the 2.24.0 cut since it
will be the last release with python 2 support.

[1] https://issues.apache.org/jira/browse/BEAM-10571

On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath 
wrote:

>
>
> On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette 
>> wrote:
>> >
>> > Ah yes I'm +1 for that approach too - it would let us leverage all the
>> schema-inference already in the Java SDK for translating configuration
>> objects which would be great.
>> > Things on the Python side would be trickier as schemas don't formally
>> support all the types you can use in the PayloadBuilder implementations [1]
>> yet, just NamedTuple. For now we could just make the PayloadBuilder
>> implementations generate Rows without making that translation available for
>> use in PCollections.
>>
>
> This will be a good opportunity to add some sort of a minimal Python type
> to Beam schema mapping :)
>
>
>>
>> Yes, though eventually it might be nice to support all of these
>> various types as schema'd PCollection elements as well.
>>
>> > Do we need to worry about update compatibility for
>> ExternalConfigurationPayload?
>>
>> Technically, each URN defines their payload, and the fact that we've
>> settled on ExternalConfigurationPayload is a convention. On a
>> practical note, we haven't declared these protos stable yet. (I would
>> like to do so before we drop support for Python 2, as external
>> transforms are a possible escape hatch and the first strong motivation
>> to have external transforms that span Beam versions).
>>
>
> +1
>
>
>>
>> > [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>> >
>> > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> I would be in favor of just using a schema to store the entire
>> >> configuration. The reason we went with what we have to day is that we
>> >> didn't have cross language schemas yet.
>> >>
>> >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette 
>> wrote:
>> >> >
>> >> > Hi everyone,
>> >> > I noticed that currently the ExternalConfigurationPayload uses a
>> list of coder URNs to represent the coder that was used to serialize each
>> configuration field [1]. This seems acceptable at first blush, but there's
>> one notable issue: it has no place to store a payload for the coder. Most
>> standard coders don't use a payload so it's not a problem, but row coder
>> does use a payload to store it's schema, which means it can't be used in an
>> ExternalConfigurationPayload today.
>> >> >
>> >> > Is there a reason not to just use the Coder message [2] in
>> ExternalConfigurationPayload instead of a list of coder URNs? That would
>> work with row coder, and it would also make it easier to re-use logic for
>> translating Pipeline protos.
>> >> >
>> >> > I'd be happy to make this change, but I wanted to ask on dev@ in
>> case there's something I'm missing here.
>> >> >
>> >> > Brian
>> >> >
>> >> > [1]
>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
>> >> > [2]
>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555
>>
>


Re: Unknown accumulator coder error when running cross-language SpannerIO Write

2020-08-03 Thread Brian Hulette
The DirectRunner error looks like it's because the FnApiRunner doesn't
support SDF.

What is the coder id for the Flink error? It looks like the full stack
trace should contain it.

On Mon, Aug 3, 2020 at 10:09 AM Piotr Szuberski 
wrote:

> I'm Writing SpannerIO.Write cross-language transform and when I try to run
> it from python I receive errors:
>
> On Flink:
> apache_beam.utils.subprocess_server: INFO: b'Caused by:
> java.lang.IllegalArgumentException: Transform external_1HolderCoder uses
> unknown accumulator coder id %s'
> apache_beam.utils.subprocess_server: INFO: b'\tat
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)'
> apache_beam.utils.subprocess_server: INFO: b'\tat
> org.apache.beam.runners.core.construction.graph.PipelineValidator.validateCombine(PipelineValidator.java:273)'
>
> On DirectRunner:
>   File
> "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> line 181, in run_via_runner_api
> self._validate_requirements(pipeline_proto)
>   File
> "/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
> line 264, in _validate_requirements
> raise ValueError(
> ValueError: Missing requirement declaration:
> {'beam:requirement:pardo:splittable_dofn:v1'}
>
> I suppose that SpannerIO.Write uses a transform that cannot be translated
> in cross-language usage? I'm not sure whether there is something I can do
> about it.
>
>


Re: Request Throttling in OSSIO

2020-08-03 Thread Luke Cwik
Since you are working on a new connector I would very strongly
suggest writing it as a splittable DoFn instead of an UnboundedSource. See
this thread[1] about additional details and some caveats on the
recommendation.

1) You can return false from advance and the runner will execute advance at
some point in time instead of sleeping. This is also the correct thing to
do if you hit a throttling error. With a splittable DoFn you can return a
process continuation allowing you to suggest an amount of time to wait
before being resumed.

2) It looks like null was returned as the checkpoint mark coder[2].

1:
https://lists.apache.org/thread.html/r76bac40fd22ebf96f379efbaef36fc27c65bdb859f504e19da76ff01%40%3Cdev.beam.apache.org%3E
2:
https://github.com/apache/beam/blob/fa3ca2b11e2ca031232245814389d29c805f79e7/runners/direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java#L223

On Thu, Jul 30, 2020 at 3:41 PM Praveen K Viswanathan <
harish.prav...@gmail.com> wrote:

> Hello Dev team,
>
> We are giving our first shot in writing Beam IO connector for Oracle
> Streaming Service (OSS). The plan is to first implement it for enterprise
> use and based on the feedback and stability make it available open source.
> This is our first attempt in developing a Beam IO connector and so far we
> have progressed with the help of Beam documentation and other related IOs
> like KafkaIO, KinesisIO. Thanks to the community on that front.
>
> Now OSS *has a read limit of 200ms* so when we read the data as shown
> below in our UnboundedReaders *advance()* method
>
> // Get Messages
>
> GetMessagesResponse getResponse =
> this.streamClient.getMessages(getRequest);
>
> We are able to read around five message but after that we are getting *request
> throttling error*
>
> Request was throttled because requests limit exhausted, next request can
> be made in 200 ms
>
> We tried with an initial solution of introducing *Thread.sleep(200)*
> before the getMessages to see how it is behaving and this time we are *able
> to read around 300+ messages*. With the expertise available in this
> forum, I would like to hear inputs on two points.
>
>1.
>
>How to implement this feature in a proper way rather than just with a
>one-line Thread.sleep(200)
>2.
>
>After adding Thread.sleep(200) and reading 300+ messages the pipeline
>encountered below error. I do not see any implementation specific detail in
>the stack trace. Can I get an insight what this error could be and how to
>handle.
>
>java.lang.NullPointerException
>at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream 
> (CoderUtils.java:82)
>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
> (CoderUtils.java:66)
>at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray 
> (CoderUtils.java:51)
>at org.apache.beam.sdk.util.CoderUtils.clone (CoderUtils.java:141)
>at 
> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.getReader
>  (UnboundedReadEvaluatorFactory.java:224)
>at 
> org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement
>  (UnboundedReadEvaluatorFactory.java:132)
>at 
> org.apache.beam.runners.direct.DirectTransformExecutor.processElements 
> (DirectTransformExecutor.java:160)
>at org.apache.beam.runners.direct.DirectTransformExecutor.run 
> (DirectTransformExecutor.java:124)
>at java.util.concurrent.Executors$RunnableAdapter.call 
> (Executors.java:511)
>at java.util.concurrent.FutureTask.run (FutureTask.java:266)
>at java.util.concurrent.ThreadPoolExecutor.runWorker 
> (ThreadPoolExecutor.java:1149)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run 
> (ThreadPoolExecutor.java:624)
>at java.lang.Thread.run (Thread.java:748)
>
>
>
> --
> Thanks,
> Praveen K Viswanathan
>


Re: [External] Re: DISCUSS: Sorted MapState API

2020-08-03 Thread Catlyn Kong
Hey folks,

Sry I'm late to this thread but this might be very helpful for the problem
we're dealing with. Do we have a design doc or a jira ticket I can follow?

Cheers,
Catlyn

On Thu, Jun 18, 2020 at 1:11 PM Jan Lukavský  wrote:

> My questions were just an example. I fully agree there is a fundamental
> need for a sorted state (of some form, and I also think this links to
> efficient implementation of retrations) - I was reacting to Kenn's question
> about BIP. This one would be pretty nice example why it would be good to
> have such a "process" - not everything can be solved on ML and there are
> fundamental decisions that might need a closer attention.
> On 6/18/20 5:28 PM, Reuven Lax wrote:
>
> Jan - my proposal is exactly TimeSortedBagState (more accurately -
> TimeSortedListState), though I went a bit further and also proposed a way
> to have a dynamic number of tagged TimeSortedBagStates.
>
> You are correct that the runner doesn't really have to store the data time
> sorted - what's actually needed is the ability to fetch and remove
> timestamp ranges of data (though that does include fetching the entire
> list); TimeOrderedState is probably a more accurate name then
> TimeSortedState. I don't think we could get away with operations that only
> act on the smallest timestamp, however we could limit the API to only being
> able to fetch and remove prefixes of data (ordered by timestamp). However
> if we support prefixes, we might as well support arbitrary subranges.
>
> On Thu, Jun 18, 2020 at 7:26 AM Jan Lukavský  wrote:
>
>> Big +1 for a BIP, as this might really help clarify all the pros and cons
>> of all possibilities. There seem to be questions that need answering and
>> motivating use cases - do we need sorted map state or can we solve our use
>> cases by something simpler - e.g. the mentioned TimeSortedBagState? Does
>> that really have to be time-sorted structure, or does it "only" have to
>> have operations that can efficiently find and remove element with smallest
>> timestamp (like a PriorityQueue)?
>>
>> Jan
>> On 6/18/20 5:32 AM, Kenneth Knowles wrote:
>>
>> Zooming in from generic philosophy to be clear: adding time ordered
>> buffer to the Fn state API is *not* a shortcut.It has benefits that will
>> not be achieved by SDK-side implementation on top of either ordered or
>> unordered multimap. Are those benefits worth expanding the API? I don't
>> know.
>>
>> A change to allow a runner to have a specialized implementation for
>> time-buffered state would be one or more StateKey types, right? Reuven,
>> maybe put this and your Java API in a doc? A BIP? Seems like there's at
>> least the following to explore:
>>
>>  - how that Java API would map to an SDK-side implementation on top of
>> multimap state key
>>  - how that Java API would map to a new StateKey
>>  - whether there's actually more than one relevant implementation of that
>> StateKey
>>  - whether SDK-side implementation on some other state key would be
>> performant enough in all SDK languages (present and future)
>>
>> Zooming back out to generic philosophy: Proliferation of StateKey
>> types tuned by runners (which can very easily still share implementation)
>> is probably better than proliferation of complex SDK-side implementations
>> with varying completeness and performance.
>>
>> Kenn
>>
>> On Wed, Jun 17, 2020 at 3:24 PM Reuven Lax  wrote:
>>
>>> It might help for me to describe what I have in mind. I'm still
>>> proposing that we build multimap, just not a globally-sorted multimap.
>>>
>>> My previous proposal was that we provide a Multimap state
>>> type, sorted by key. this would have two additional operations -
>>> multimap.getRange(startKey, endKey) and multimap.deleteRange(startKey,
>>> endKey). The primary use case was timestamp sorting, but I felt that a
>>> sorted multimap provided a nice generalization - after all, you can simply
>>> key the multimap by timestamp to get timestamp sorting.
>>>
>>> This approach had some issues immediately that would take some work to
>>> solve. Since a multimap key can have any type and a runner will only be
>>> able to sort by encoded type, we would need to introduce a concept of
>>> order-preserving coders into Beam and plumb that through. Robert pointed
>>> out that even our existing standard coders for simple integral types don't
>>> preserve order, so there will likely be surprises here.
>>>
>>> My current proposal is for a multimap that is not sorted by key, but
>>> that can support.ordered values for a single key. Remember that a multimap
>>> maps K -> Iterable, so this means that each individual Iterable is
>>> ordered, but the keys have no specific order relative to each other. This
>>> is not too different from many multimap implementations where the keys are
>>> unordered, but the list of values for a single key at least has a stable
>>> order.
>>>
>>> The interface would look like this:
>>>
>>> public interface MultimapState extends State {
>>>   // Add a 

Re: Monitoring performance for releases

2020-08-03 Thread Robert Bradshaw
I have to admit I still have some qualms about tying detecting and fixing
performance regressions as part of the release process (which is onerous
enough as it is). Instead, I think we'd be better off with a
separate process to detect and triage performance issues, which, when they
occur, may merit filing a blocker which will require fixing before the
release just like any other blocker would. Hopefully this would result in
issues being detected (and resolved) sooner.

That being said, if a release is known to have performance regressions,
that should be called out when the RCs are cut, and if not resolved,
probably as part of the release notes as well.

On Mon, Aug 3, 2020 at 9:40 AM Maximilian Michels  wrote:

> Here a first version of the updated release guide:
> https://github.com/apache/beam/pull/12455
>
> Feel free to comment.
>
> -Max
>
> On 29.07.20 17:27, Maximilian Michels wrote:
> > Thanks! I'm following up with this PR to display the Flink Pardo
> > streaming data: https://github.com/apache/beam/pull/12408
> >
> > Streaming data appears to be missing for Dataflow. We can revise the
> > Jenkins jobs to add those.
> >
> > -Max
> >
> > On 29.07.20 17:01, Tyson Hamilton wrote:
> >> Max,
> >>
> >> The runner dimension are present when hovering over a particular
> >> graph. For some more info, the load test configurations can be found
> >> here [1]. I didn't get a chance to look into them but there are tests
> >> for all the runners there, possibly not for every loadtest.
> >>
> >> [1]: https://github.com/apache/beam/tree/master/.test-infra/jenkins
> >>
> >> -Tyson
> >>
> >> On Wed, Jul 29, 2020 at 3:46 AM Maximilian Michels  >> > wrote:
> >>
> >> Looks like the permissions won't be necessary because backup data
> >> gets
> >> loaded into the local InfluxDb instance which makes writing queries
> >> locally possible.
> >>
> >> On 29.07.20 12:21, Maximilian Michels wrote:
> >>  > Thanks Michał!
> >>  >
> >>  > It is a bit tricky to verify the exported query works if I don't
> >> have
> >>  > access to the data stored in InfluxDb.
> >>  >
> >>  > ==> Could somebody give me permissions to max.mich...@gmail.com
> >>  for
> >>  > apache-beam-testing such that I can setup a ssh port-forwarding
> >> from the
> >>  > InfluxDb pod to my machine? I do have access to see the pods but
> >> that is
> >>  > not enough.
> >>  >
> >>  >> I think that the only test data is from Python streaming tests,
> >> which
> >>  >> are not implemented right now (check out
> >>  >>
> >>
> >>
> http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python
> >>
> >>
> >>  >> )
> >>  >
> >>  > Additionally, there is an entire dimension missing: Runners. I'm
> >>  > assuming this data is for Dataflow?
> >>  >
> >>  > -Max
> >>  >
> >>  > On 29.07.20 11:55, Michał Walenia wrote:
> >>  >> Hi there,
> >>  >>
> >>  >>  > Indeed the Python load test data appears to be missing:
> >>  >>  >
> >>  >>
> >>
> >>
> http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python
> >>
> >>
> >>  >>
> >>  >>
> >>  >> I think that the only test data is from Python streaming tests,
> >> which
> >>  >> are not implemented right now (check out
> >>  >>
> >>
> >>
> http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python
> >>
> >>
> >>  >> )
> >>  >>
> >>  >> As for updating the dashboards, the manual for doing this is
> >> here:
> >>  >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards
> >>
> >>
> >>  >>
> >>  >>
> >>  >> I hope this helps,
> >>  >>
> >>  >> Michal
> >>  >>
> >>  >> On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels
> >> mailto:m...@apache.org>
> >>  >> >> wrote:
> >>  >>
> >>  >> Indeed the Python load test data appears to be missing:
> >>  >>
> >>  >>
> >>
> >>
> http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python
> >>
> >>
> >>  >>
> >>  >>
> >>  >> How do we typically modify the dashboards?
> >>  >>
> >>  >> It looks like we need to edit this json file:
> >>  >>
> >>  >>
> >>
> >>
> https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81
> >>
> >>
> >>  >>
> >>  >>
> >>  >> I found some documentation on the deployment:
> >>  >>
> >>  >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring
> >>  >>
> >>  >>
> >>  >> +1 for alerting or weekly emails including performance
> >> numbers for
> >>  >> fixed
> >>  >> 

Re: Stateful Pardo Question

2020-08-03 Thread jmac...@godaddy.com
Yeah, unless I am misunderstanding something. The output from my repro code 
shows event timestamp and the context timestamp every time we process an event.

Receiving event at: 2000-01-01T00:00:00.000Z
Resetting timer to : 2000-01-01T00:15:00.000Z
Receiving event at: 2000-01-01T00:05:00.000Z
Resetting timer to : 2000-01-01T00:20:00.000Z <-- Shouldn’t the timer have 
fired before we processed the next event?
Receiving event at: 2000-01-01T00:40:00.000Z
Why didnt the timer fire?
Resetting timer to : 2000-01-01T00:55:00.000Z
Receiving event at: 2000-01-01T00:45:00.000Z
Resetting timer to : 2000-01-01T01:00:00.000Z
Receiving event at: 2000-01-01T00:50:00.000Z
Resetting timer to : 2000-01-01T01:05:00.000Z
Timer firing at: 2000-01-01T01:05:00.000Z

From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Monday, August 3, 2020 at 10:02 AM
To: dev 
Subject: Re: Stateful Pardo Question

Notice: This email is from an external sender.


Are you sure that there is a 15 minute gap in your data?

On Mon, Aug 3, 2020 at 6:20 AM jmac...@godaddy.com 
mailto:jmac...@godaddy.com>> wrote:
I am confused about the behavior of timers on a simple stateful pardo. I have 
put together a little repro here: https://github.com/randomsamples/pardo_repro

I basically want to build something like a session window, accumulating events 
until quiescence of the stream for a given key and gap time, then output 
results. But it appears that the timer is not firing when the watermark is 
passed it expiration time, so the event stream is not being split as I would 
have expected. Would love some help getting this work, the behavior is for a 
project I’m working on.


Re: Development on Windows

2020-08-03 Thread Luke Cwik
It looks like Windows file locking is getting in your way since one gradle
task is reading a file which is produced as a temporary file that is part
of another gradle test task and that test task is trying to delete it.

Try running the build with --no-parallel to disable parallel builds[1].
This will make your build a lot slower though but will avoid the issue of
having multiple gradle tasks stepping on each other's toes due to file
locking.

For a longer term fix, it seems as though spotless shouldn't care about
these temp files being produced so feel free to contribute a patch which
removes spotless from checking anything which isn't a source file.

1:
https://docs.gradle.org/current/userguide/multi_project_builds.html#sec:parallel_execution



On Sun, Aug 2, 2020 at 6:17 AM Matthew Ouyang 
wrote:

> Hello,
>
> I am working on a small patch on a Windows environment.  But I'm unable to
> build the entire project end-to-end.  Here are a few of the errors I am
> receiving.
>
> Is there any sort of setup,   Windows-specific or otherwise, required in
> order to build the project completely?
>
> Matt
>
>
> ./gradlew build
> FAILURE: Build completed with 5 failures.
>
> 1: Task failed with an exception.
> ---
> * What went wrong:
> Execution failed for task ':beam-test-infra-metrics:spotlessGroovyGradle'.
> > The following files had format violations:
>   .test-infra\metrics\build.gradle
>   @@ -1,120 +1,120 @@
>
>
>
> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
> java.nio.file.FileSystemException:
> C:\Users\maouyan\AppData\Local\Temp\junit1348014844526437497\BigQueryExtractTemp\99a851333c89412b8cf0fe0d53a7c084\.avro:
> The process cannot access the file because it is being used by another
> process.
>
>
> ---
>
> Class org.apache.beam.sdk.io.gcp.bigquery.BigQueryIOReadTest
> all > org.apache.beam.sdk.io.gcp.bigquery > BigQueryIOReadTest
>
> testReadFromTableWithTemplateCompatibility
> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
> java.nio.file.FileSystemException:
> C:\Users\maouyan\AppData\Local\Temp\junit1348014844526437497\BigQueryExtractTemp\99a851333c89412b8cf0fe0d53a7c084\.avro:
> The process cannot access the file because it is being used by another
> process.
>
>
> ---
>
> Class org.apache.beam.sdk.io.gcp.bigquery.BigQueryIOWriteTest
> all > org.apache.beam.sdk.io.gcp.bigquery > BigQueryIOWriteTest
>
> testWritePartitionSinglePartition
> java.nio.file.InvalidPathException: Illegal char <:> at index 10:
> project-id:dataset-id.tables0_files0
>


Re: Beam Dependency Check Report (2020-08-03)

2020-08-03 Thread Damian Gadomski
That's probably caused by this [1] PR, workspace had been deleted before
the email was sent.

+Udi Meiri  Moving the workspace clean up to the very end
of post-build actions should help.

[1] https://github.com/apache/beam/pull/12326

On Mon, Aug 3, 2020 at 5:42 PM Brian Hulette  wrote:

> Does anyone know what went wrong here? It looks like the
> associated jenkins job [1] succeeded, and produced
> beam-dependency-check-report.html
>
> [1] https://ci-beam.apache.org/job/beam_Dependency_Check/279/
>
> On Mon, Aug 3, 2020 at 5:28 AM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> ERROR: File
>> 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not
>> exist
>
>


Fwd: Contributor permission for Beam Jira tickets

2020-08-03 Thread Varun Sharma
Hi,

This is Varun from THG. I have created a couple of tickets related to
improvements in Beam Java SDK. Could someone add me as a contributor
for Beam's Jira issue tracker? I would like to assign tickets for my work.

My JIRA id- varun.sharma

Regards
Varun Sharma


-- Forwarded message -
From: Varun Sharma 
Date: Fri, Jul 24, 2020 at 12:10 AM
Subject: Contributor permission for Beam Jira tickets
To: 


Hi,

This is Varun from THG. I have created a couple of tickets related to
improvements in Beam Java SDK. Could someone add me as a contributor
for Beam's Jira issue tracker? I would like to assign tickets for my work.

My JIRA id- varun.sharma


Regards
Varun Sharma


Unknown accumulator coder error when running cross-language SpannerIO Write

2020-08-03 Thread Piotr Szuberski
I'm Writing SpannerIO.Write cross-language transform and when I try to run it 
from python I receive errors:

On Flink:
apache_beam.utils.subprocess_server: INFO: b'Caused by: 
java.lang.IllegalArgumentException: Transform external_1HolderCoder uses 
unknown accumulator coder id %s'
apache_beam.utils.subprocess_server: INFO: b'\tat 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)'
apache_beam.utils.subprocess_server: INFO: b'\tat 
org.apache.beam.runners.core.construction.graph.PipelineValidator.validateCombine(PipelineValidator.java:273)'

On DirectRunner:
  File 
"/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
 line 181, in run_via_runner_api
self._validate_requirements(pipeline_proto)
  File 
"/Users/piotr/beam/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py",
 line 264, in _validate_requirements
raise ValueError(
ValueError: Missing requirement declaration: 
{'beam:requirement:pardo:splittable_dofn:v1'}

I suppose that SpannerIO.Write uses a transform that cannot be translated in 
cross-language usage? I'm not sure whether there is something I can do about it.



Re: Stateful Pardo Question

2020-08-03 Thread Reuven Lax
Are you sure that there is a 15 minute gap in your data?

On Mon, Aug 3, 2020 at 6:20 AM jmac...@godaddy.com 
wrote:

> I am confused about the behavior of timers on a simple stateful pardo. I
> have put together a little repro here:
> https://github.com/randomsamples/pardo_repro
>
>
>
> I basically want to build something like a session window, accumulating
> events until quiescence of the stream for a given key and gap time, then
> output results. But it appears that the timer is not firing when the
> watermark is passed it expiration time, so the event stream is not being
> split as I would have expected. Would love some help getting this work, the
> behavior is for a project I’m working on.
>


[BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

2020-08-03 Thread David Janíček
Hello everyone,

I've reported an issue https://issues.apache.org/jira/browse/BEAM-10292
which is about broken DefaultFilenamePolicy.ParamsCoder behavior.
DefaultFilenamePolicy.ParamsCoder loses information whether
DefaultFilenamePolicy.Params's baseFilename resource is file or directory
on some filesystems, at least on local FS and HDFS.

After discussion with @dmvk and @lukecwik, we have agreed that the best
solution could be to take the breaking change and use ResourceIdCoder for
encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this way the
file/directory information is preserved.
The solution is implemented in pull request
https://github.com/apache/beam/pull/12050.

I'd like to ask if there is a consensus on this breaking change. Is
everyone OK with this?
Thanks in advance for answers.

Best regards,
David


Re: Email about build runs on my fork.

2020-08-03 Thread Alex Amato
Thanks, havent seen any emails since rebasing to master.

On Sun, Aug 2, 2020 at 5:09 AM Tobiasz Kędzierski <
tobiasz.kedzier...@polidea.com> wrote:

> Hi Alex,
>
> After rebase on the latest master scheduled workflow should not run,
> condition for scheduled job was extended to prevent situations like this.
> As mentioned by Robert, you can disable gh action in case you don't need
> it.
>
> BR
> Tobiasz
>
> On Thu, Jul 30, 2020 at 9:18 PM Robert Burke  wrote:
>
>> You can disable GitHub actions on your own repos via the UI or via a code
>> change
>>
>> https://github.community/t/how-can-i-disable-a-github-action/17049
>>
>> On Thu, Jul 30, 2020, 12:15 PM Ahmet Altay  wrote:
>>
>>> /cc +tobiasz.kedzier...@polidea.com  +Emily
>>> Ye  -- this question is related to one of the
>>> recent github action prs.
>>>
>>> On Thu, Jul 30, 2020 at 10:23 AM Alex Amato  wrote:
>>>
 Hi,

 I received this email indicating some build was running on my fork,
 though I had not been doing any work on that fork for the last few weeks.

 I don't really need to run these builds on my fork and don't think we
 need to waste resources on this. Is there some way to prevent forks from
 doing this?

 I just rebased my own fork from apache beam master now. I am not sure
 if that will stop it or not, but it should now be up to date.

 -- Forwarded message -
 From: Alex Amato 
 Date: Wed, Jul 29, 2020 at 7:47 PM
 Subject: [ajamato/beam] Run failed: Build python wheels - master
 (9ca80ae)
 To: ajamato/beam 
 Cc: Ci activity 


 Run failed for master (9ca80ae)

 Repository: ajamato/beam
 Workflow: Build python wheels
 Duration: 29 minutes and 27.0 seconds
 Finished: 2020-07-30 02:47:17 UTC

 View results 
 Jobs:

- build_source 
succeeded (0 annotations)
- Build wheels on ubuntu-latest
 succeeded (0
annotations)
- Build wheels on macos-latest
 succeeded (0
annotations)
- Prepare GCS 
succeeded (0 annotations)
- Upload source to GCS bucket
 failed (1
annotation)
- Tag repo nightly 
succeeded (0 annotations)
- Upload wheels to GCS bucket (ubuntu-latest)
 cancelled (2
annotations)
- Upload wheels to GCS bucket (macos-latest)
 failed (1
annotation)
- List files on Google Cloud Storage Bucket
 skipped (0
annotations)

 —
 You are receiving this because this workflow ran on your branch.
 Manage your GitHub Actions notifications here
 .

>>>


Re: Monitoring performance for releases

2020-08-03 Thread Maximilian Michels
Here a first version of the updated release guide: 
https://github.com/apache/beam/pull/12455


Feel free to comment.

-Max

On 29.07.20 17:27, Maximilian Michels wrote:
Thanks! I'm following up with this PR to display the Flink Pardo 
streaming data: https://github.com/apache/beam/pull/12408


Streaming data appears to be missing for Dataflow. We can revise the 
Jenkins jobs to add those.


-Max

On 29.07.20 17:01, Tyson Hamilton wrote:

Max,

The runner dimension are present when hovering over a particular 
graph. For some more info, the load test configurations can be found 
here [1]. I didn't get a chance to look into them but there are tests 
for all the runners there, possibly not for every loadtest.


[1]: https://github.com/apache/beam/tree/master/.test-infra/jenkins

-Tyson

On Wed, Jul 29, 2020 at 3:46 AM Maximilian Michels > wrote:


    Looks like the permissions won't be necessary because backup data 
gets

    loaded into the local InfluxDb instance which makes writing queries
    locally possible.

    On 29.07.20 12:21, Maximilian Michels wrote:
 > Thanks Michał!
 >
 > It is a bit tricky to verify the exported query works if I don't
    have
 > access to the data stored in InfluxDb.
 >
 > ==> Could somebody give me permissions to max.mich...@gmail.com
     for
 > apache-beam-testing such that I can setup a ssh port-forwarding
    from the
 > InfluxDb pod to my machine? I do have access to see the pods but
    that is
 > not enough.
 >
 >> I think that the only test data is from Python streaming tests,
    which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 



 >> )
 >
 > Additionally, there is an entire dimension missing: Runners. I'm
 > assuming this data is for Dataflow?
 >
 > -Max
 >
 > On 29.07.20 11:55, Michał Walenia wrote:
 >> Hi there,
 >>
 >>  > Indeed the Python load test data appears to be missing:
 >>  >
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



 >>
 >>
 >> I think that the only test data is from Python streaming tests,
    which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 



 >> )
 >>
 >> As for updating the dashboards, the manual for doing this is 
here:

 >>

https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards 



 >>
 >>
 >> I hope this helps,
 >>
 >> Michal
 >>
 >> On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels
    mailto:m...@apache.org>
 >> >> wrote:
 >>
 >>     Indeed the Python load test data appears to be missing:
 >>
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



 >>
 >>
 >>     How do we typically modify the dashboards?
 >>
 >>     It looks like we need to edit this json file:
 >>
 >>

https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81 



 >>
 >>
 >>     I found some documentation on the deployment:
 >>
 >>

https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring

 >>
 >>
 >>     +1 for alerting or weekly emails including performance
    numbers for
 >>     fixed
 >>     intervals (1d, 1w, 1m, previous release).
 >>
 >>     +1 for linking the dashboards in the release guide to allow
    for a
 >>     comparison as part of the release process.
 >>
 >>     As a first step, consolidating all the data seems like the 
most

 >>     pressing
 >>     problem to solve.
 >>
 >>     @Kamil I could need some advice regarding how to proceed
    updating the
 >>     dashboards.
 >>
 >>     -Max
 >>
 >>     On 22.07.20 20:20, Robert Bradshaw wrote:
 >>  > On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise
    mailto:t...@apache.org>
 >>     >
 >>  > 
    >  >
 >>  >     It appears that there is coverage missing in the 
Grafana

 >>     dashboards
 >>  >     (it could also be that I just don't find it).
 >>  >
 >>  >     For example:
 >>  >
 >>
 >>

https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056 



 >>
 >>  >
 >>  >     The GBK and ParDo tests have a selection for {batch,
 >>     streaming} and
 >>  >     SDK. 

Re: Beam Dependency Check Report (2020-08-03)

2020-08-03 Thread Brian Hulette
Does anyone know what went wrong here? It looks like the associated jenkins
job [1] succeeded, and produced beam-dependency-check-report.html

[1] https://ci-beam.apache.org/job/beam_Dependency_Check/279/

On Mon, Aug 3, 2020 at 5:28 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> ERROR: File
> 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not
> exist


Stateful Pardo Question

2020-08-03 Thread jmac...@godaddy.com
I am confused about the behavior of timers on a simple stateful pardo. I have 
put together a little repro here: https://github.com/randomsamples/pardo_repro

I basically want to build something like a session window, accumulating events 
until quiescence of the stream for a given key and gap time, then output 
results. But it appears that the timer is not firing when the watermark is 
passed it expiration time, so the event stream is not being split as I would 
have expected. Would love some help getting this work, the behavior is for a 
project I’m working on.


Beam Dependency Check Report (2020-08-03)

2020-08-03 Thread Apache Jenkins Server
ERROR: File 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not exist