date:20210520

Beam Learning Month is here!

2021-05-20 Thread Mara Ruvalcaba

#BeamLearningMonth is here! and you can register now to the different 
meetups:


June 9th, Dataflow 
templates:https://sg1.run/ataflowemplate 
s


June 16th, Advanced Beam use cases: 
https://sg1.run/dvancedpacheeamseas 
es


June 23rd, Multi-language pipeline frameworks: 
https://sg1.run/pacheeamipeli 
nes


Beam Learning Month image


--
Mara Ruvalcaba
COO, SG Software Guru & Nearshore Link
USA: 512 296 2884
MX: 55 5239 5502

Re: Some questions around GroupIntoBatches

2021-05-20 Thread Reuven Lax

One more question about GroupIntoBatches. The current code has a "prefetch"
frequency built in that calls readLater on the bag state every time we
reach 20% of the batch size. Does anyone know what this is for? Was this
intended to be an optimization, and if so did it help?

On Thu, May 20, 2021 at 10:08 AM Reuven Lax  wrote:

> I don't think that would work well. It's entirely possible that the input
> watermark will already have passed the timestamp of the hold, in which case
> an event-time timer would fire immediately. You could make it a looping
> timer, but the new timer would also fire immediately after being set, and a
> tight timer loop isn't a good idea.
>
> A Timer.get() is one solution, though I think the only way we have to
> implement it is to store the timer's timestamp in a ValueState; doing this
> for every timer would add a lot of cost to pipelines.
>
> Another option is a Timer.getCurrentTime() method, that would return the
> current base time that setRelative is based off of. It seems like a strange
> function to add to Timer though.
>
> Another option is to use TimerMap to bucket timers. Every minute we round
> the current processing time to the nearest minute and set a timer with an
> expiration of that minute (and with the minute timestamp as its tag as
> well). This way we would have a continuous sequence of expiring timers, and
> we wouldn't have to set just the first one. The biggest problem with this
> approach is that we would also have to use MapState to store the desired
> watermark hold per  processing-time bucket. MapState is not supported by
> many runners yet, so I don't want to use it in a basic transform like
> GroupIntoBatches (furthermore - a transform that works on most runners
> today).
>
> Reuven
>
>
> On Thu, May 20, 2021 at 2:11 AM Jan Lukavský  wrote:
>
>> Sounds like you could solve that using second event time timer, that
>> would be actually used only to hold the output timestamp (watermark hold).
>> Something like
>>
>>
>> eventTimer.withOutputTimestamp(currentBufferMinimalStamp).offset(Duration.ZERO).setRelative()
>>
>> when the timer fires, you would only reset the minimum.
>>
>> It is sort of ugly, though. It would be cool to have a way to get the
>> current timestamp a timer is set to (if any).
>>
>>  Jan
>> On 5/20/21 3:12 AM, Reuven Lax wrote:
>>
>> 100% - the contract should not change because things are in a bundle.
>> IIRC there are some open bugs in Beam around this that really should be
>> fixed.
>>
>> My issue with GroupIntoBatches is different. This transform works as
>> follows:
>>
>> if (this is the first element in batch - checked by reading a count
>> stored in a ValueState)
>>timer.offset(bufferingDuration).setRelative()
>>
>> This makes it tricky to use setTimer.withOutputTimestamp. Inputs are not
>> guaranteed to be in order, so simply adding a withOutputTimestamp would set
>> the timestamp to be whatever the first element happened to be; it really
>> should be the minimum timestamp of all elements in the buffer. If we
>> started setting the timer on every element, then timer.offset.setRelative
>> would keep bumping the (processing-time) timer into the future and it would
>> never expire.
>>
>> One solution would be to store the timer timestamp in a ValueState, and
>> use Timer.set to set the timer to an absolute timestamp. This would allow
>> us to always reset the timer to the same expiration target, just modifying
>> the output timestamp each time. However, this will break DirectRunner
>> tests. The DirectRunner allows the user to control the advancement of
>> processing time when using TestStream, but this facility doesn't work well
>> if the transform sets the processing-time timer using absolute set() calls.
>>
>> I'm not sure how to solve this using the existing Timer API.
>>
>> On Wed, May 19, 2021 at 4:39 PM Robert Bradshaw 
>> wrote:
>>
>>> +1. It was my understanding as well that consensus was that timers
>>> must be delivered in timestamp order, and "within bundle"
>>> resetting/clearing of timers should be respected (as if each timer was
>>> in its own bundle).
>>>
>>> On Wed, May 19, 2021 at 3:01 PM Kenneth Knowles  wrote:
>>> >
>>> > Reading over the other thread, there was consensus to implement.
>>> >
>>> > Reading commentary on the PR, there were good questions raised about
>>> the semantics. Questions which I feel able to have an opinion about :-)
>>> >
>>> > The questions surrounded bundling and timers in the same bundling
>>> clearing each other. Actually the same questions apply to timers re-setting
>>> later timers and +Jan Lukavský has raised this already (among other people)
>>> so we kind of know the answer now, and I think +Boyuan Zhang code was good
>>> (from my quick read). What has changed is that we have a better idea of the
>>> contract with the runner. I'm not sure if portability makes this more
>>> complex. I will share all my thoughts on this:
>>> >
>>> > I think one key to the Beam model is that

Fwd: Inclusive terminology in Apache Software - a practical approach

2021-05-20 Thread Kenneth Knowles

Check out this tool. I tried it, adjusted Beam's settings a bit.

https://clcdemo.net/analysis.html?project=beam.git

Kenn

-- Forwarded message -
From: Daniel Gruno 
Date: Thu, May 20, 2021 at 6:22 AM
Subject: Inclusive terminology in Apache Software - a practical approach
To: 

Hi folks,
following the discussion on inclusive naming, I got to thinking and
discussing a bit more with Rich and others about practical approaches we
could explore under the EDI banner.

And thus, I started work on a scanner service for repositories that
could identify places where we can improve wording. The resulting
service I have dubbed CLC (Conscious Language Checker), and a demo is up
and running at https://clcdemo.net/

I picked a few Apache projects to test it on, and you can see individual
analyses by clicking on the charts, for instance
https://clcdemo.net/analysis.html?project=httpd.git

Anyone can currently edit settings and see the updated result once a new
scan runs (twice daily currently). This will eventually be committers only.

I am proposing we put this under the EDI banner with a public repository
for the service and a .apache.org hostname, for instance
clc.diversity.apache.org and then invite projects to participate once
the service is production ready. I believe this could be a good
practical goal to achieve at EDI, and that it could help projects more
easily adjust their terminologies.

WDYT?

With regards,
Daniel.

-
To unsubscribe, e-mail: diversity-unsubscr...@apache.org
For additional commands, e-mail: diversity-h...@apache.org

Flaky test issue report

2021-05-20 Thread Beam Jira Bot

This is your daily summary of Beam's current flaky tests. These are P1 issues 
because they have a major negative impact on the community and make it hard to 
determine the quality of the software.

BEAM-12322: FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) 
(https://issues.apache.org/jira/browse/BEAM-12322)
BEAM-12309: PubSubIntegrationTest.test_streaming_data_only flake 
(https://issues.apache.org/jira/browse/BEAM-12309)
BEAM-12307: PubSubBigQueryIT.test_file_loads flake 
(https://issues.apache.org/jira/browse/BEAM-12307)
BEAM-12303: Flake in PubSubIntegrationTest.test_streaming_with_attributes 
(https://issues.apache.org/jira/browse/BEAM-12303)
BEAM-12293: FlinkSavepointTest.testSavepointRestoreLegacy flakes due to 
FlinkJobNotFoundException (https://issues.apache.org/jira/browse/BEAM-12293)
BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (https://issues.apache.org/jira/browse/BEAM-12291)
BEAM-12200: SamzaStoreStateInternalsTest is flaky 
(https://issues.apache.org/jira/browse/BEAM-12200)
BEAM-12163: Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK 
harness startup (https://issues.apache.org/jira/browse/BEAM-12163)
BEAM-12061: beam_PostCommit_SQL failing on 
KafkaTableProviderIT.testFakeNested 
(https://issues.apache.org/jira/browse/BEAM-12061)
BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (https://issues.apache.org/jira/browse/BEAM-12019)
BEAM-11792: Python precommit failed (flaked?) installing package  
(https://issues.apache.org/jira/browse/BEAM-11792)
BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (https://issues.apache.org/jira/browse/BEAM-11666)
BEAM-11662: elasticsearch tests failing 
(https://issues.apache.org/jira/browse/BEAM-11662)
BEAM-11661: hdfsIntegrationTest flake: network not found (py38 postcommit) 
(https://issues.apache.org/jira/browse/BEAM-11661)
BEAM-11645: beam_PostCommit_XVR_Flink failing 
(https://issues.apache.org/jira/browse/BEAM-11645)
BEAM-11541: testTeardownCalledAfterExceptionInProcessElement flakes on 
direct runner. (https://issues.apache.org/jira/browse/BEAM-11541)
BEAM-11540: Linter sometimes flakes on apache_beam.dataframe.frames_test 
(https://issues.apache.org/jira/browse/BEAM-11540)
BEAM-10995: Java + Universal Local Runner: 
WindowingTest.testWindowPreservation fails 
(https://issues.apache.org/jira/browse/BEAM-10995)
BEAM-10987: stager_test.py::StagerTest::test_with_main_session flaky on 
windows py3.6,3.7 (https://issues.apache.org/jira/browse/BEAM-10987)
BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-10968)
BEAM-10955: Flink Java Runner test flake: Could not find Flink job  
(https://issues.apache.org/jira/browse/BEAM-10955)
BEAM-10866: PortableRunnerTestWithSubprocesses.test_register_finalizations 
flaky on macOS (https://issues.apache.org/jira/browse/BEAM-10866)
BEAM-10504: Failure / flake in ElasticSearchIOTest > 
testWriteFullAddressing and testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10504)
BEAM-10501: CheckGrafanaStalenessAlerts and PingGrafanaHttpApi fail with 
Connection refused (https://issues.apache.org/jira/browse/BEAM-10501)
BEAM-10485: Failure / flake: ElasticsearchIOTest > testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10485)
BEAM-9649: beam_python_mongoio_load_test started failing due to mismatched 
results (https://issues.apache.org/jira/browse/BEAM-9649)
BEAM-9392: TestStream tests are all flaky 
(https://issues.apache.org/jira/browse/BEAM-9392)
BEAM-9232: BigQueryWriteIntegrationTests is flaky coercing to Unicode 
(https://issues.apache.org/jira/browse/BEAM-9232)
BEAM-9119: 
apache_beam.runners.portability.fn_api_runner_test.FnApiRunnerTest[...].test_large_elements
 is flaky (https://issues.apache.org/jira/browse/BEAM-9119)
BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (https://issues.apache.org/jira/browse/BEAM-8101)
BEAM-8035: [beam_PreCommit_Java_Phrase] 
[WatchTest.testMultiplePollsWithManyResults]  Flake: Outputs must be in 
timestamp order (https://issues.apache.org/jira/browse/BEAM-8035)
BEAM-7992: Unhandled type_constraint in 
apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
 (https://issues.apache.org/jira/browse/BEAM-7992)
BEAM-7827: MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is 
flaky on DirectRunner (https://issues.apache.org/jira/browse/BEAM-7827)
BEAM-7752: Java Validates DirectRunner: 
testTeardownCalledAfterExceptionInFinishBundleStateful flaky

P1 issues report

2021-05-20 Thread Beam Jira Bot

This is your daily summary of Beam's current P1 issues, not including flaky 
tests.

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

BEAM-12374: Spark postcommit failing ResumeFromCheckpointStreamingTest 
(https://issues.apache.org/jira/browse/BEAM-12374)
BEAM-12316: LGPL in bundled dependencies 
(https://issues.apache.org/jira/browse/BEAM-12316)
BEAM-12310: beam_PostCommit_Java_DataflowV2 failing 
(https://issues.apache.org/jira/browse/BEAM-12310)
BEAM-12279: Implement destination-dependent sharding in FileIO.writeDynamic 
(https://issues.apache.org/jira/browse/BEAM-12279)
BEAM-12258: SQL postcommit timing out 
(https://issues.apache.org/jira/browse/BEAM-12258)
BEAM-12256: PubsubIO.readAvroGenericRecord creates SchemaCoder that fails 
to decode some Avro logical types 
(https://issues.apache.org/jira/browse/BEAM-12256)
BEAM-12231: beam_PostRelease_NightlySnapshot failing 
(https://issues.apache.org/jira/browse/BEAM-12231)
BEAM-11959: Python Beam SDK Harness hangs when installing pip packages 
(https://issues.apache.org/jira/browse/BEAM-11959)
BEAM-11906: No trigger early repeatedly for session windows 
(https://issues.apache.org/jira/browse/BEAM-11906)
BEAM-11875: XmlIO.Read does not handle XML encoding per spec 
(https://issues.apache.org/jira/browse/BEAM-11875)
BEAM-11828: JmsIO is not acknowledging messages correctly 
(https://issues.apache.org/jira/browse/BEAM-11828)
BEAM-11755: Cross-language consistency (RequiresStableInputs) is quietly 
broken (at least on portable flink runner) 
(https://issues.apache.org/jira/browse/BEAM-11755)
BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when int 
overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow Runner 
(https://issues.apache.org/jira/browse/BEAM-11576)
BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor 
(https://issues.apache.org/jira/browse/BEAM-11434)
BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 
(https://issues.apache.org/jira/browse/BEAM-11227)
BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink 
(https://issues.apache.org/jira/browse/BEAM-11148)
BEAM-11017: Timer with dataflow runner can be set multiple times (dataflow 
runner) (https://issues.apache.org/jira/browse/BEAM-11017)
BEAM-10670: Make non-portable Splittable DoFn the only option when 
executing Java "Read" transforms 
(https://issues.apache.org/jira/browse/BEAM-10670)
BEAM-10617: python CombineGlobally().with_fanout() cause duplicate combine 
results for sliding windows (https://issues.apache.org/jira/browse/BEAM-10617)
BEAM-10569: SpannerIO tests don't actually assert anything. 
(https://issues.apache.org/jira/browse/BEAM-10569)
BEAM-10288: Quickstart documents are out of date 
(https://issues.apache.org/jira/browse/BEAM-10288)
BEAM-10244: Populate requirements cache fails on poetry-based packages 
(https://issues.apache.org/jira/browse/BEAM-10244)
BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all data 
(https://issues.apache.org/jira/browse/BEAM-10100)
BEAM-9564: Remove insecure ssl options from MongoDBIO 
(https://issues.apache.org/jira/browse/BEAM-9564)
BEAM-9455: Environment-sensitive provisioning for Dataflow 
(https://issues.apache.org/jira/browse/BEAM-9455)
BEAM-9293: Python direct runner doesn't emit empty pane when it should 
(https://issues.apache.org/jira/browse/BEAM-9293)
BEAM-8986: SortValues may not work correct for numerical types 
(https://issues.apache.org/jira/browse/BEAM-8986)
BEAM-8985: SortValues should fail if SecondaryKey coder is not 
deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get marked 
as passing (Direct Runner) (https://issues.apache.org/jira/browse/BEAM-8407)
BEAM-7717: PubsubIO watermark tracking hovers near start of epoch 
(https://issues.apache.org/jira/browse/BEAM-7717)
BEAM-7716: PubsubIO returns empty message bodies for all messages read 
(https://issues.apache.org/jira/browse/BEAM-7716)
BEAM-7195: BigQuery - 404 errors for 'table not found' when using dynamic 
destinations - sometimes, new table fails to get created 
(https://issues.apache.org/jira/browse/BEAM-7195)
BEAM-6839: User reports protobuf ClassChangeError running against 2.6.0 or 
above (https://issues.apache.org/jira/browse/BEAM-6839)
BEAM-6466: KafkaIO doesn't commit offsets while being used as bounded 
source (https://issues.apache.org/jira/browse/BEAM-6466)

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-05-20 Thread Brian Hulette

> So the question is how do we proceed? Do I contact INFRA to enable it for
the main repo?

No objections from me, is this something we should vote on though? Perhaps
we already have lazy consensus :)

> more concretely how do we deal with these PRs in a practical sense? Do we
rename them and create an associated JIRA for tracking?

Yes I wonder about this too. I only have more questions though:

Would we set reviewers [1] to make sure these PRs get attention? Who should
they go to? It would be nice if we could have reviewers per dependency,
similar to [2], but it looks like it has to be the same person or group for
all PRs.
What about our existing dependency tracking which sends the dependency
update message to dev@ and files JIRAs? Probably we could just leave that
on in parallel for now, and consider turning it down later if dependabot is
working well.

Brian

[1]
https://docs.github.com/en/code-security/supply-chain-security/keeping-your-dependencies-updated-automatically/configuration-options-for-dependency-updates#reviewers
[2]
https://github.com/apache/beam/tree/243128a8fc52798e1b58b0cf1a271d95ee7aa241/ownership

On Wed, May 12, 2021 at 5:43 AM Ismaël Mejía  wrote:

> My excuses Brian I had not seen your question:
>
> > - Will dependabot work ok with the version ranges that we specify? For
> example some Python dependencies have upper bounds for the next major
> version, some for the next minor version. Is dependabot smart enough to try
> bumping the appropriate version number?
>
> Yes, it does and we can also explicitly set it to ignore certain versions
> or a all for each dependency if we don't want to have any PR upgrade for it.
>
> As a follow up on this I received an email from my Beam fork this morning
> reporting a CVE issue on one of the website dependencies, it is a moderate
> issue since this is a dep for the website generation, so it won't affect
> Beam users) but it is a clear example of the utility of dependabot.
>
> So the question is how do we proceed? Do I contact INFRA to enable it for
> the main repo? and more concretely how do we deal with these PRs in a
> practical sense? Do we rename them and create an associated JIRA for
> tracking?
>
> Other opinions?
>
> Ismaël
>
>
>
> On Fri, Apr 16, 2021 at 5:36 PM Brian Hulette  wrote:
>
>> Yeah I can see the advantage in tooling like this for easy upgrades. I
>> suspect many of the outdated Python dependencies fall under this category,
>> but the toil of creating a PR and verifying it passes tests is enough of a
>> barrier that we just haven't done it. Having a bot create the PR and
>> trigger CI to verify it would be helpful IMO.
>>
>> Some questions/concerns I have:
>> - I think many python upgrades will still require manual work:
>>   - We also have pinned versions for some Python dependencies in
>> base_image_requirements.txt [1]
>>   - We test with multiple major versions of pyarrow. We'd want to add a
>> new test environment [2] when bumping to the next major version
>> - Will dependabot work ok with the version ranges that we specify? For
>> example some Python dependencies have upper bounds for the next major
>> version, some for the next minor version. Is dependabot smart enough to try
>> bumping the appropriate version number?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>
>> [2]
>> https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/python/tox.ini#L231
>>
>> On Fri, Apr 16, 2021 at 7:16 AM Ismaël Mejía  wrote:
>>
>>> Oh forgot to mention one alternative that we do in the Avro project,
>>> it is that we don't create issues for the dependabot PRs and then we
>>> search all the commits authored by dependabot and include them in the
>>> release notes to track dependency upgrades.
>>>
>>> On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
>>> >
>>> > > Quite often, dependency upgrade to latest versions leads to either
>>> compilation errors or failed tests and it should be resolved manually or
>>> declined. Having this, maybe I miss something, but I don’t see what kind of
>>> advantages automatic upgrade will bring to us except that we don’t need to
>>> create a PR manually (which is a not big deal).
>>> >
>>> > The advantage is exactly that, that we don't have to create and track
>>> > dependency updates manually, it will be done by the bot and we will
>>> > only have to do the review and guarantee that no issues are
>>> > introduced. I forgot to mention but we can create exception rules so
>>> > no further upgrades will be proposed for some dependencies e.g.
>>> > Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
>>> > advantage that is the detailed security report that will help us
>>> > prioritize dependency upgrades.
>>> >
>>> > > Regarding another issue - it’s already a problem, imho. Since we
>>> have a one Jira per package upgrade now and usually it “accumulates” all
>>> package upgrades and it’s not

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Charles Chen

Is it feasible to keep the endpoint information in the path?  It seems
pretty desirable to keep URIs "universal" so that it's possible to
understand what is being pointed to without explicit service configuration,
so maybe you can have a scheme like "s3+endpoint=api.example.com
://my/bucket/path"?

On Thu, May 20, 2021 at 12:31 PM Kenneth Knowles  wrote:

> $.02
>
> Most important is community to maintain it. It cannot be a separate
> project or subproject (lots of ASF projects have this, so they share
> governance) without that.
>
> To add additional friction of separate release and dependency in build
> before you have community, it should be extremely stable so you upgrade
> rarely. See the process of upgrading our vendored deps. It is considerable.
>
> Kenn
>
> On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:
>
>> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:
>>
>>> Hi Brian,
>>> I think the main goal would be to make a python package that could be
>>> pip installed independently of apache_beam.  That goal could be
>>> accomplished with option 3, thus preserving all of the benefits of a
>>> monorepo. If it gains enough popularity and contributors outside of the
>>> Beam community, then options 1 and 2 could be considered to make it easier
>>> to foster a new community of contributors.
>>>
>>
>> This sounds like a lovely goal!
>>
>> I'll just mention the "fsspec" Python project, which came out of Dask:
>> https://filesystem-spec.readthedocs.io/en/latest/
>>
>> As far as I can tell, it serves basically this exact same purpose
>> (generic filesystems with high-performance IO), and has started to get some
>> traction in other projects, e.g., it's now used in pandas. I don't know if
>> it would be suitable for Beam, but it might be worth a try.
>>
>> Cheers,
>> Stephan
>>
>>
>>> Beam has a lot of great tech in it, and it makes me think of Celery,
>>> which is a much older python project of a similar ilk that spawned a series
>>> of useful independent projects: kombu [1], an AMQP messaging library, and
>>> billiard [2], a multiprocessing library.
>>>
>>> Obviously, there are a number of pros and cons to consider.  The cons
>>> are pretty clear: even within a monorepo it will make the Beam build more
>>> complicated.  The pros are a bit more abstract.  The fileIO project could
>>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>>> etc), thereby increasing awareness of Beam amongst the types of
>>> cloud-friendly python developers who would need the fileIO package.
>>>
>>> -chad
>>>
>>> [1] https://github.com/celery/kombu
>>> [2] https://github.com/celery/billiard
>>>
>>>
>>>
>>>
>>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
>>> wrote:
>>>
 That's an interesting idea. What do you mean by its own project? A
 couple of possibilities:
 - Spinning off a new ASF project
 - A separate Beam-governed repository (e.g. apache/beam-filesystems)
 - More clearly separate it in the current build system and release
 artifacts that allow it to be used independently

 Personally I'd be resistant to the first two (I am a Google engineer
 and I like monorepos after all), but I don't see a major problem with the
 last one, except that it gives us another surface to maintain.

 Brian

 On Wed, May 19, 2021 at 8:38 PM Chad Dombrova 
 wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it
> mind for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
> wrote:
>
>> That would be great to add, Matt. Of course it's important to make
>> this backwards compatible, but other than that, the addition would be 
>> very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I
>>> would like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>>> URI
>>> schemes) and it is in any case impossible to instantiate more than one 
>>> in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>>> out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>>

[PROPOSAL] Preparing for Beam 2.31.0 release

2021-05-20 Thread Andrew Pilloud

Hi All,

Beam 2.31.0 release is scheduled to be cut in two weeks, on June 2
according to the release calendar [1].

I'd like to volunteer myself to be the release manager for this release. I
plan on cutting the release branch on the scheduled date. That week starts
with a US holiday and I will be on vacation the entire week so I don't
expect to make any progress on the release until June 7th. If there is a
desire I could delay the release cut by a few days.

Any comments or objections?

Andrew

[1]
https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Kenneth Knowles

$.02

Most important is community to maintain it. It cannot be a separate project
or subproject (lots of ASF projects have this, so they share governance)
without that.

To add additional friction of separate release and dependency in build
before you have community, it should be extremely stable so you upgrade
rarely. See the process of upgrading our vendored deps. It is considerable.

Kenn

On Thu, May 20, 2021 at 12:07 PM Stephan Hoyer  wrote:

> On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:
>
>> Hi Brian,
>> I think the main goal would be to make a python package that could be pip
>> installed independently of apache_beam.  That goal could be accomplished
>> with option 3, thus preserving all of the benefits of a monorepo. If it
>> gains enough popularity and contributors outside of the Beam community,
>> then options 1 and 2 could be considered to make it easier to foster a new
>> community of contributors.
>>
>
> This sounds like a lovely goal!
>
> I'll just mention the "fsspec" Python project, which came out of Dask:
> https://filesystem-spec.readthedocs.io/en/latest/
>
> As far as I can tell, it serves basically this exact same purpose (generic
> filesystems with high-performance IO), and has started to get some traction
> in other projects, e.g., it's now used in pandas. I don't know if it would
> be suitable for Beam, but it might be worth a try.
>
> Cheers,
> Stephan
>
>
>> Beam has a lot of great tech in it, and it makes me think of Celery,
>> which is a much older python project of a similar ilk that spawned a series
>> of useful independent projects: kombu [1], an AMQP messaging library, and
>> billiard [2], a multiprocessing library.
>>
>> Obviously, there are a number of pros and cons to consider.  The cons are
>> pretty clear: even within a monorepo it will make the Beam build more
>> complicated.  The pros are a bit more abstract.  The fileIO project could
>> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
>> etc), thereby increasing awareness of Beam amongst the types of
>> cloud-friendly python developers who would need the fileIO package.
>>
>> -chad
>>
>> [1] https://github.com/celery/kombu
>> [2] https://github.com/celery/billiard
>>
>>
>>
>>
>> On Thu, May 20, 2021 at 7:57 AM Brian Hulette 
>> wrote:
>>
>>> That's an interesting idea. What do you mean by its own project? A
>>> couple of possibilities:
>>> - Spinning off a new ASF project
>>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>>> - More clearly separate it in the current build system and release
>>> artifacts that allow it to be used independently
>>>
>>> Personally I'd be resistant to the first two (I am a Google engineer and
>>> I like monorepos after all), but I don't see a major problem with the last
>>> one, except that it gives us another surface to maintain.
>>>
>>> Brian
>>>
>>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>>>
 This is a random idea, but the whole file IO system inside Beam would
 actually be awesome to extract into its own project.  IIRC, it’s not
 particularly tied to Beam.

 I’m not saying this should be done now, but it’s be nice to keep it
 mind for a future goal.

 -chad



 On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
 wrote:

> That would be great to add, Matt. Of course it's important to make
> this backwards compatible, but other than that, the addition would be very
> welcome.
>
> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
> wrote:
>
>> Hi,
>>
>>
>>
>> This is a quick sketch of a proposal – I wanted to get a sense of
>> whether there’s general support for this idea before fleshing it out
>> further, getting internal approvals, etc.
>>
>>
>>
>> I’m working with multiple storage systems that speak the S3 api. I
>> would like to support FileIO operations for these storage systems, but
>> S3FileSystem hardcodes the s3 scheme (the various systems use different 
>> URI
>> schemes) and it is in any case impossible to instantiate more than one in
>> the current design.
>>
>>
>>
>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked 
>> out
>> the details yet, but it will take some thought to make this work in a
>> non-hacky way.
>>
>>
>>
>> Thanks
>>
>> Matt Rudary
>>
>

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Ismaël Mejía

I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 4 of which are binding:
* Pablo Estrada
* Etienne Chauchot
* Jean-Baptiste Onofre
* Ismaël Mejía

There are no disapproving votes.

Thanks everyone!

On Thu, May 20, 2021 at 9:17 PM Ismaël Mejía  wrote:

>
> +1 (binding)
>

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Ismaël Mejía

+1 (binding)

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Stephan Hoyer

On Thu, May 20, 2021 at 10:12 AM Chad Dombrova  wrote:

> Hi Brian,
> I think the main goal would be to make a python package that could be pip
> installed independently of apache_beam.  That goal could be accomplished
> with option 3, thus preserving all of the benefits of a monorepo. If it
> gains enough popularity and contributors outside of the Beam community,
> then options 1 and 2 could be considered to make it easier to foster a new
> community of contributors.
>

This sounds like a lovely goal!

I'll just mention the "fsspec" Python project, which came out of Dask:
https://filesystem-spec.readthedocs.io/en/latest/

As far as I can tell, it serves basically this exact same purpose (generic
filesystems with high-performance IO), and has started to get some traction
in other projects, e.g., it's now used in pandas. I don't know if it would
be suitable for Beam, but it might be worth a try.

Cheers,
Stephan


> Beam has a lot of great tech in it, and it makes me think of Celery, which
> is a much older python project of a similar ilk that spawned a series of
> useful independent projects: kombu [1], an AMQP messaging library, and
> billiard [2], a multiprocessing library.
>
> Obviously, there are a number of pros and cons to consider.  The cons are
> pretty clear: even within a monorepo it will make the Beam build more
> complicated.  The pros are a bit more abstract.  The fileIO project could
> appeal to a broader audience, and act as a signpost for Beam (on PyPI,
> etc), thereby increasing awareness of Beam amongst the types of
> cloud-friendly python developers who would need the fileIO package.
>
> -chad
>
> [1] https://github.com/celery/kombu
> [2] https://github.com/celery/billiard
>
>
>
>
> On Thu, May 20, 2021 at 7:57 AM Brian Hulette  wrote:
>
>> That's an interesting idea. What do you mean by its own project? A couple
>> of possibilities:
>> - Spinning off a new ASF project
>> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
>> - More clearly separate it in the current build system and release
>> artifacts that allow it to be used independently
>>
>> Personally I'd be resistant to the first two (I am a Google engineer and
>> I like monorepos after all), but I don't see a major problem with the last
>> one, except that it gives us another surface to maintain.
>>
>> Brian
>>
>> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>>
>>> This is a random idea, but the whole file IO system inside Beam would
>>> actually be awesome to extract into its own project.  IIRC, it’s not
>>> particularly tied to Beam.
>>>
>>> I’m not saying this should be done now, but it’s be nice to keep it mind
>>> for a future goal.
>>>
>>> -chad
>>>
>>>
>>>
>>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>>> wrote:
>>>
 That would be great to add, Matt. Of course it's important to make this
 backwards compatible, but other than that, the addition would be very
 welcome.

 On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
 wrote:

> Hi,
>
>
>
> This is a quick sketch of a proposal – I wanted to get a sense of
> whether there’s general support for this idea before fleshing it out
> further, getting internal approvals, etc.
>
>
>
> I’m working with multiple storage systems that speak the S3 api. I
> would like to support FileIO operations for these storage systems, but
> S3FileSystem hardcodes the s3 scheme (the various systems use different 
> URI
> schemes) and it is in any case impossible to instantiate more than one in
> the current design.
>
>
>
> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
> the details yet, but it will take some thought to make this work in a
> non-hacky way.
>
>
>
> Thanks
>
> Matt Rudary
>

Apache Beam Access Request for Anthony Zhu

2021-05-20 Thread Anthony Zhu

Hi,

I am requesting permission to access the Apache Beam JIRA for Anthony Zhu,
username aqzhu for my work with Rui Wang.

Thanks,
Anthony

-- 
*Anthony Zhu*
University of Michigan Class of 2022
College of Engineering | CSE
978.496.4537 | aq...@umich.edu

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Chad Dombrova

Hi Brian,
I think the main goal would be to make a python package that could be pip
installed independently of apache_beam.  That goal could be accomplished
with option 3, thus preserving all of the benefits of a monorepo. If it
gains enough popularity and contributors outside of the Beam community,
then options 1 and 2 could be considered to make it easier to foster a new
community of contributors.

Beam has a lot of great tech in it, and it makes me think of Celery, which
is a much older python project of a similar ilk that spawned a series of
useful independent projects: kombu [1], an AMQP messaging library, and
billiard [2], a multiprocessing library.

Obviously, there are a number of pros and cons to consider.  The cons are
pretty clear: even within a monorepo it will make the Beam build more
complicated.  The pros are a bit more abstract.  The fileIO project could
appeal to a broader audience, and act as a signpost for Beam (on PyPI,
etc), thereby increasing awareness of Beam amongst the types of
cloud-friendly python developers who would need the fileIO package.

-chad

[1] https://github.com/celery/kombu
[2] https://github.com/celery/billiard

On Thu, May 20, 2021 at 7:57 AM Brian Hulette  wrote:

> That's an interesting idea. What do you mean by its own project? A couple
> of possibilities:
> - Spinning off a new ASF project
> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
> - More clearly separate it in the current build system and release
> artifacts that allow it to be used independently
>
> Personally I'd be resistant to the first two (I am a Google engineer and I
> like monorepos after all), but I don't see a major problem with the last
> one, except that it gives us another surface to maintain.
>
> Brian
>
> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>
>> This is a random idea, but the whole file IO system inside Beam would
>> actually be awesome to extract into its own project.  IIRC, it’s not
>> particularly tied to Beam.
>>
>> I’m not saying this should be done now, but it’s be nice to keep it mind
>> for a future goal.
>>
>> -chad
>>
>>
>>
>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>> wrote:
>>
>>> That would be great to add, Matt. Of course it's important to make this
>>> backwards compatible, but other than that, the addition would be very
>>> welcome.
>>>
>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>>> wrote:
>>>
 Hi,

 This is a quick sketch of a proposal – I wanted to get a sense of
 whether there’s general support for this idea before fleshing it out
 further, getting internal approvals, etc.

 I’m working with multiple storage systems that speak the S3 api. I
 would like to support FileIO operations for these storage systems, but
 S3FileSystem hardcodes the s3 scheme (the various systems use different URI
 schemes) and it is in any case impossible to instantiate more than one in
 the current design.

 I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
 maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
 the details yet, but it will take some thought to make this work in a
 non-hacky way.

 Thanks

 Matt Rudary

>>>

Re: Some questions around GroupIntoBatches

2021-05-20 Thread Reuven Lax

I don't think that would work well. It's entirely possible that the input
watermark will already have passed the timestamp of the hold, in which case
an event-time timer would fire immediately. You could make it a looping
timer, but the new timer would also fire immediately after being set, and a
tight timer loop isn't a good idea.

A Timer.get() is one solution, though I think the only way we have to
implement it is to store the timer's timestamp in a ValueState; doing this
for every timer would add a lot of cost to pipelines.

Another option is a Timer.getCurrentTime() method, that would return the
current base time that setRelative is based off of. It seems like a strange
function to add to Timer though.

Another option is to use TimerMap to bucket timers. Every minute we round
the current processing time to the nearest minute and set a timer with an
expiration of that minute (and with the minute timestamp as its tag as
well). This way we would have a continuous sequence of expiring timers, and
we wouldn't have to set just the first one. The biggest problem with this
approach is that we would also have to use MapState to store the desired
watermark hold per  processing-time bucket. MapState is not supported by
many runners yet, so I don't want to use it in a basic transform like
GroupIntoBatches (furthermore - a transform that works on most runners
today).

Reuven


On Thu, May 20, 2021 at 2:11 AM Jan Lukavský  wrote:

> Sounds like you could solve that using second event time timer, that would
> be actually used only to hold the output timestamp (watermark hold).
> Something like
>
>
> eventTimer.withOutputTimestamp(currentBufferMinimalStamp).offset(Duration.ZERO).setRelative()
>
> when the timer fires, you would only reset the minimum.
>
> It is sort of ugly, though. It would be cool to have a way to get the
> current timestamp a timer is set to (if any).
>
>  Jan
> On 5/20/21 3:12 AM, Reuven Lax wrote:
>
> 100% - the contract should not change because things are in a bundle. IIRC
> there are some open bugs in Beam around this that really should be fixed.
>
> My issue with GroupIntoBatches is different. This transform works as
> follows:
>
> if (this is the first element in batch - checked by reading a count stored
> in a ValueState)
>timer.offset(bufferingDuration).setRelative()
>
> This makes it tricky to use setTimer.withOutputTimestamp. Inputs are not
> guaranteed to be in order, so simply adding a withOutputTimestamp would set
> the timestamp to be whatever the first element happened to be; it really
> should be the minimum timestamp of all elements in the buffer. If we
> started setting the timer on every element, then timer.offset.setRelative
> would keep bumping the (processing-time) timer into the future and it would
> never expire.
>
> One solution would be to store the timer timestamp in a ValueState, and
> use Timer.set to set the timer to an absolute timestamp. This would allow
> us to always reset the timer to the same expiration target, just modifying
> the output timestamp each time. However, this will break DirectRunner
> tests. The DirectRunner allows the user to control the advancement of
> processing time when using TestStream, but this facility doesn't work well
> if the transform sets the processing-time timer using absolute set() calls.
>
> I'm not sure how to solve this using the existing Timer API.
>
> On Wed, May 19, 2021 at 4:39 PM Robert Bradshaw 
> wrote:
>
>> +1. It was my understanding as well that consensus was that timers
>> must be delivered in timestamp order, and "within bundle"
>> resetting/clearing of timers should be respected (as if each timer was
>> in its own bundle).
>>
>> On Wed, May 19, 2021 at 3:01 PM Kenneth Knowles  wrote:
>> >
>> > Reading over the other thread, there was consensus to implement.
>> >
>> > Reading commentary on the PR, there were good questions raised about
>> the semantics. Questions which I feel able to have an opinion about :-)
>> >
>> > The questions surrounded bundling and timers in the same bundling
>> clearing each other. Actually the same questions apply to timers re-setting
>> later timers and +Jan Lukavský has raised this already (among other people)
>> so we kind of know the answer now, and I think +Boyuan Zhang code was good
>> (from my quick read). What has changed is that we have a better idea of the
>> contract with the runner. I'm not sure if portability makes this more
>> complex. I will share all my thoughts on this:
>> >
>> > I think one key to the Beam model is that bundles are for performance
>> and also intended as the unit of commitment (until FinishBundle is called,
>> there may be unfinished work). They can affect *behavior* (what the program
>> does - including what can be observed) ... but not *semantics* (what the
>> output means).
>> >
>> > So, for example, bundling affects how many files are written but you
>> are expected to read all the files and the number or ordering you must not
>> depend on. The

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Brian Hulette

That's an interesting idea. What do you mean by its own project? A couple
of possibilities:
- Spinning off a new ASF project
- A separate Beam-governed repository (e.g. apache/beam-filesystems)
- More clearly separate it in the current build system and release
artifacts that allow it to be used independently

Personally I'd be resistant to the first two (I am a Google engineer and I
like monorepos after all), but I don't see a major problem with the last
one, except that it gives us another surface to maintain.

Brian

On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it mind
> for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada  wrote:
>
>> That would be great to add, Matt. Of course it's important to make this
>> backwards compatible, but other than that, the addition would be very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I would
>>> like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>>> schemes) and it is in any case impossible to instantiate more than one in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Matt Rudary
>>>
>>

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Jean-Baptiste Onofre

+1 (binding)

Regards
JB

> Le 20 mai 2021 à 10:56, Etienne Chauchot  a écrit :
> 
> +1 (binding) on releasing vendored bytebuddy for testing in 
> https://github.com/apache/beam/pull/14824 
> 
> Etienne
> 
> On 19/05/2021 23:43, Kai Jiang wrote:
>> +1 (non-binding)
>> 
>> On Wed, May 19, 2021 at 12:23 PM Jan Lukavský > > wrote:
>> +1 (non-binding)
>> 
>> verified correct shading.
>> 
>>  Jan
>> 
>> On 5/19/21 8:53 PM, Ismaël Mejía wrote:
>>> This release is only to publish the vendored dependency artifacts. We need 
>>> those to integrate it and be able to verify if it causes problems or not. 
>>> The PR for this is already opened but it needs the artifacts of this vote 
>>> to be ran.
>>> https://github.com/apache/beam/pull/14824 
>>> 
>>> 
>>> For ref there is a document on how to release and validate releases of 
>>> Beam's vendored dependencies that can be handy to anyone wishing to help 
>>> validate:
>>> https://s.apache.org/beam-release-vendored-artifacts 
>>> 
>>> On Wed, May 19, 2021 at 8:45 PM Tyson Hamilton >> > wrote:
>>> I'd like to help, but I don't know how to determine whether this upgrade is 
>>> going to cause problems or not. Are there tests I should look at, or some 
>>> validation I should perform?
>>> 
>>> On Wed, May 19, 2021 at 11:29 AM Ismaël Mejía >> > wrote:
>>> Kind reminder, the vote is ongoing
>>> 
>>> On Mon, May 17, 2021 at 5:32 PM Ismaël Mejía >> > wrote:
>>> Please review the release of the following artifacts that we vendor:
>>>  * beam-vendor-bytebuddy-1_11_0
>>> 
>>> Hi everyone,
>>> Please review and vote on the release candidate #1 for the version 0.1, as 
>>> follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> 
>>> The complete staging area is available for your review, which includes:
>>> * the official Apache source release to be deployed to dist.apache.org 
>>>  [1], which is signed with the key with 
>>> fingerprint 3415631729E15B33051ADB670A9DAF6713B86349 [2],
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> * commit hash "d93c591deb21237ddb656583d7ef7a4debba" [4],
>>> 
>>> The vote will be open for at least 72 hours. It is adopted by majority 
>>> approval, with at least 3 PMC affirmative votes.
>>> 
>>> Thanks,
>>> Release Manager
>>> 
>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/ 
>>> 
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS 
>>> 
>>> [3] https://repository.apache.org/content/repositories/orgapachebeam-1166/ 
>>> 
>>> [4] 
>>> https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba
>>>  
>>> 
>>>

Re: Some questions around GroupIntoBatches

2021-05-20 Thread Jan Lukavský

Sounds like you could solve that using second event time timer, that 
would be actually used only to hold the output timestamp (watermark 
hold). Something like

eventTimer.withOutputTimestamp(currentBufferMinimalStamp).offset(Duration.ZERO).setRelative()

when the timer fires, you would only reset the minimum.

It is sort of ugly, though. It would be cool to have a way to get the 
current timestamp a timer is set to (if any).

 Jan

On 5/20/21 3:12 AM, Reuven Lax wrote:
100% - the contract should not change because things are in a bundle. 
IIRC there are some open bugs in Beam around this that really should 
be fixed.

My issue with GroupIntoBatches is different. This transform works as 
follows:

if (this is the first element in batch - checked by reading a count 
stored in a ValueState)

   timer.offset(bufferingDuration).setRelative()

This makes it tricky to use setTimer.withOutputTimestamp. Inputs are 
not guaranteed to be in order, so simply adding a withOutputTimestamp 
would set the timestamp to be whatever the first element happened to 
be; it really should be the minimum timestamp of all elements in the 
buffer. If we started setting the timer on every element, then 
timer.offset.setRelative would keep bumping the (processing-time) 
timer into the future and it would never expire.

One solution would be to store the timer timestamp in a ValueState, 
and use Timer.set to set the timer to an absolute timestamp. This 
would allow us to always reset the timer to the same expiration 
target, just modifying the output timestamp each time. However, this 
will break DirectRunner tests. The DirectRunner allows the user to 
control the advancement of processing time when using TestStream, but 
this facility doesn't work well if the transform sets the 
processing-time timer using absolute set() calls.

I'm not sure how to solve this using the existing Timer API.

On Wed, May 19, 2021 at 4:39 PM Robert Bradshaw > wrote:

+1. It was my understanding as well that consensus was that timers
must be delivered in timestamp order, and "within bundle"
resetting/clearing of timers should be respected (as if each timer was
in its own bundle).

On Wed, May 19, 2021 at 3:01 PM Kenneth Knowles mailto:k...@apache.org>> wrote:
>
> Reading over the other thread, there was consensus to implement.
>
> Reading commentary on the PR, there were good questions raised
about the semantics. Questions which I feel able to have an
opinion about :-)
>
> The questions surrounded bundling and timers in the same
bundling clearing each other. Actually the same questions apply to
timers re-setting later timers and +Jan Lukavský has raised this
already (among other people) so we kind of know the answer now,
and I think +Boyuan Zhang code was good (from my quick read). What
has changed is that we have a better idea of the contract with the
runner. I'm not sure if portability makes this more complex. I
will share all my thoughts on this:
>
> I think one key to the Beam model is that bundles are for
performance and also intended as the unit of commitment (until
FinishBundle is called, there may be unfinished work). They can
affect *behavior* (what the program does - including what can be
observed) ... but not *semantics* (what the output means).
>
> So, for example, bundling affects how many files are written but
you are expected to read all the files and the number or ordering
you must not depend on. The behavior is different, but the
semantics are the same.
>
> When it comes to timers, behavior and semantics are very tightly
coupled; timers are like a self loop. The firing of a timer is a
behavior w.r.t. the output of the stateful ParDo but it is
semantics of the timer itself (my definitions don't actually apply
so simply so don't dig too hard :-). So to get bundling-invariant
semantics, we should try for bundling-invariant behavior. When
some clock moves from T to T+D between bundles then all the timers
in the range [T, T+D) may fire so they are delivered in the
bundle. I believe we have in the prior re-setting issue agreed
that timers should always be called back in timestamp order.
Bundling-invariance then implies that earlier timers can clear
later timers just like they can re-set them. So even though a
timer is delivered in a bundle, the local state of the timer wins.
This matches how state works as well; no matter how things are
bundled, the state you read is always whatever was written last.
>
> Kenn
>
> On Tue, May 11, 2021 at 1:24 PM Siyuan Chen mailto:syc...@google.com>> wrote:
>>
>>
>>
>> On Tue, May 11, 2021 at 11:08 AM Reuven Lax mailto:re...@google.com>> wrote:
>>>
>>>
>>>
>>> On Tue, May 11, 2021 at 9:01 AM Kenneth Knowles
mailto:k...@apache.org>> wrote:

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Etienne Chauchot

+1 (binding) on releasing vendored bytebuddy for testing in 
https://github.com/apache/beam/pull/14824


Etienne

On 19/05/2021 23:43, Kai Jiang wrote:

+1 (non-binding)

On Wed, May 19, 2021 at 12:23 PM Jan Lukavský > wrote:


+1 (non-binding)

verified correct shading.

 Jan

On 5/19/21 8:53 PM, Ismaël Mejía wrote:

This release is only to publish the vendored dependency
artifacts. We need those to integrate it and be able to verify if
it causes problems or not. The PR for this is already opened but
it needs the artifacts of this vote to be ran.
https://github.com/apache/beam/pull/14824


For ref there is a document on how to release and validate
releases of Beam's vendored dependencies that can be handy to
anyone wishing to help validate:
https://s.apache.org/beam-release-vendored-artifacts


On Wed, May 19, 2021 at 8:45 PM Tyson Hamilton
mailto:tyso...@google.com>> wrote:

I'd like to help, but I don't know how to determine whether
this upgrade is going to cause problems or not. Are there
tests I should look at, or some validation I should perform?

On Wed, May 19, 2021 at 11:29 AM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Kind reminder, the vote is ongoing

On Mon, May 17, 2021 at 5:32 PM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Please review the release of the following artifacts
that we vendor:
 * beam-vendor-bytebuddy-1_11_0

Hi everyone,
Please review and vote on the release candidate #1
for the version 0.1, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide
specific comments)

The complete staging area is available for your
review, which includes:
* the official Apache source release to be deployed
to dist.apache.org  [1],
which is signed with the key with fingerprint
3415631729E15B33051ADB670A9DAF6713B86349 [2],
* all artifacts to be deployed to the Maven Central
Repository [3],
* commit hash
"d93c591deb21237ddb656583d7ef7a4debba" [4],

The vote will be open for at least 72 hours. It is
adopted by majority approval, with at least 3 PMC
affirmative votes.

Thanks,
Release Manager

[1]
https://dist.apache.org/repos/dist/dev/beam/vendor/

[2]
https://dist.apache.org/repos/dist/release/beam/KEYS

[3]

https://repository.apache.org/content/repositories/orgapachebeam-1166/


[4]

https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba

Beam Learning Month is here!

Re: Some questions around GroupIntoBatches

Fwd: Inclusive terminology in Apache Software - a practical approach

Flaky test issue report

P1 issues report

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

Re: Proposal: Generalize S3FileSystem

[PROPOSAL] Preparing for Beam 2.31.0 release

Re: Proposal: Generalize S3FileSystem

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

Re: Proposal: Generalize S3FileSystem

Apache Beam Access Request for Anthony Zhu

Re: Proposal: Generalize S3FileSystem

Re: Some questions around GroupIntoBatches

Re: Proposal: Generalize S3FileSystem

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

Re: Some questions around GroupIntoBatches

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

19 matches

Site Navigation

Mail list logo

Footer information