Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-14 Thread Reuven Lax
Microbenchmarks are tough for these benchmarks. In the past, we've had
changes that increase the time it took to generate bytcode. While this his
minimal impact on real pipelines (since bytecode is generated on worker
startup), it has an outsized impact on microbencmark run time.

On Wed, May 12, 2021 at 5:55 AM Ismaël Mejía  wrote:

> Testing this particular kind of PR for perf would be tricky, I think the
> easiest thing we can notice is if the runtime of the CI tests differs a lot.
> I really don't think the generated bytecode with the new version would
> differ much but is for sure something we should pay attention to.
> And worse case scenario reversing the upgrade should not be that difficult
> given Beam's well confined dependency on bytebuddy.
>
> Other ideas/comments?
>
>
> On Mon, May 10, 2021 at 7:16 PM Reuven Lax  wrote:
>
>> What's the best way to test a PR for perf?
>>
>> On Mon, May 10, 2021 at 8:59 AM Kenneth Knowles  wrote:
>>
>>> If nothing breaks, and we check perf, then absolutely this seems good.
>>>
>>> Kenn
>>>
>>> On Mon, May 10, 2021 at 12:38 AM Ismaël Mejía  wrote:
>>>
 Most issues on the previous migration were related to changes on
 behavior of class-loading on Java 11. It seems Oracle is taking a more
 backwards compatible on latest releases, so let's hope everything will go
 well. In the meantime I tested the upgrade locally and tests are passing ok
 so we should be good to go. I opened a PR [1] for the version upgrade and
 assuming consensus on this proposal I expect we can pass to vote soon.

 [1] https://github.com/apache/beam/pull/14766


 On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:

> We've had some issues in the past with semantic changes in ByteBuddy
> (I think related to new Java versions) that required rewriting code in
> Beam.
>
> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía 
> wrote:
>
>> What were the issues last time Reuven? I remember that the release
>> and upgrade PR were pretty smooth, were there unintended consequences 
>> from
>> the library changes themselves?
>>
>>
>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>
>>> Sounds good. Based on previous experience though, this might be a
>>> difficult upgrade to do.
>>>
>>> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía 
>>> wrote:
>>>
 The version of bytebuddy Beam is vendoring (1.10.8) is already 16
 months old and
 it is not compatible with more recent versions of Java. I would
 like to propose
 that we upgrade it [1] to the most recent version (1.11.0) [2] so
 we can benefit
 of the latest improvements for Java 16/17 and upgraded ASM.

 If everyone agrees I would like to volunteer as the release manager
 for this
 upgrade.

 [1] https://issues.apache.org/jira/browse/BEAM-12241
 [2]
 https://github.com/raphw/byte-buddy/blob/master/release-notes.md




Re: Removing deprecated oauth2client dependency for Python SDK

2021-05-14 Thread Ahmet Altay
+Valentyn Tymofieiev  might have an idea.

On Mon, May 3, 2021 at 4:12 PM Chuck Yang  wrote:

> Hi Beam devs,
>
> I saw there has been some previous discussion [1][2] around removing
> the deprecated oauth2client dependency and using the supported
> google-auth dependency instead. A portion of this work seems to
> involve migrating off of google-apitools since this call [3] is not
> supported by credentials objects emitted by google-auth.
>
> Does anyone have any experience/insights on how much work migrating
> off of oauth2client would involve? I might be able to help out but
> wanted to see a) if anyone is already looking at this and b) if there
> are any hidden obstacles beyond needing to move from google-apitools
> to the google-cloud-* libraries. Any pointers are appreciated!
>
> We're interested in this migration because of the need to use custom
> token URIs for issuing service account tokens--it's supported by
> google-auth but not oauth2client.
>
> [1] https://issues.apache.org/jira/browse/BEAM-7352
> [2] https://github.com/google/apitools/issues/225#issuecomment-434884589
> [3]
> https://github.com/google/apitools/blob/v0.5.31/apitools/base/py/base_api.py#L266
>
> Thanks!
> Chuck
>
> --
>
>
> *Confidentiality Note:* We care about protecting our proprietary
> information, confidential material, and trade secrets. This message may
> contain some or all of those things. Cruise will suffer material harm if
> anyone other than the intended recipient disseminates or takes any action
> based on this message. If you have received this message (including any
> attachments) in error, please delete it immediately and notify the sender
> promptly.
>


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-05-14 Thread Chamikara Jayalath
Please note that I created https://issues.apache.org/jira/browse/BEAM-12338
as a potential blocker.
Fix was merged to master and sent a cherry-pick:
https://github.com/apache/beam/pull/14821

Thanks,
Cham

On Fri, May 14, 2021 at 7:59 AM Evan Galpin  wrote:

> Not sure if it's on the radar but there was mention on this Reshuffle
> PR[1] of cherry-picking it to 2.30.0
>
> [1] https://github.com/apache/beam/pull/14720
>
> Thanks,
> Evan
>
> On Thu, May 13, 2021 at 6:50 PM Heejong Lee  wrote:
>
>> UPDATE:
>>
>> All precommit and postcommit tests are passed now:
>> https://github.com/apache/beam/pull/14711
>>
>> We only have one open issue for Fix Version 2.30.0:
>> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>>
>> I will start building the RC release after we cherry-pick the last
>> blocker.
>>
>> On Thu, Apr 29, 2021 at 12:48 AM Heejong Lee  wrote:
>>
>>> We have 10 open issues for Fix Version 2.30.0:
>>> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>>>
>>> On Thu, Apr 29, 2021 at 12:30 AM Heejong Lee  wrote:
>>>
 FYI, I just cut the 2.30.0 release branch. From now on, late commits
 for 2.30.0 need to be cherry-picked. If you have any late commits, please
 make sure that their Jira issues have the correct Fix Version, 2.30.0.

 On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles 
 wrote:

> SGTM. Thanks!
>
> On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee 
> wrote:
>
>>
>>
>> On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
>> wrote:
>>
>>> Confirming that the cut date is 4/28/2021 (in two days), right?
>>>
>>
>> Yes, 2.30.0 branch is scheduled to be cut on April 28.
>>
>>
>>>
>>> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki 
>>> wrote:
>>> >
>>> > Thank you for the preparation!
>>> >
>>> > > a few responses that some high priority changes
>>> >
>>> > Would you be willing to share the items for visibility?
>>>
>>> There are several PRs in flight (or recently merged) to get
>>> portability working well with Dataflow for this release.
>>>
>>
>> We can still cherry-pick them by importance after the branch cut.
>>
>>
>>>
>>> >
>>> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
>>> wrote:
>>> > >
>>> > > Also the 2.29.0 was re-cut.
>>> > >
>>> > > Usually a delay in one release should not delay the next
>>> release, because each release represents a certain quantity of changes. 
>>> But
>>> in this case, the actual quantity of changes is affected by the re-cut, 
>>> too.
>>> > >
>>> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
>>> wrote:
>>> > >>
>>> > >> Update on the 2.30.0 branch cut schedule:
>>> > >>
>>> > >> I'm thinking of delaying the branch cut a week since I've got a
>>> few responses that some high priority changes are still ongoing.
>>> > >>
>>> > >> The new cut date is April 28.
>>> > >>
>>> > >>
>>> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
>>> wrote:
>>> > >>>
>>> > >>> +1 and thank you!
>>> > >>>
>>> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee <
>>> heej...@google.com> wrote:
>>> > 
>>> >  Hi All,
>>> > 
>>> >  Beam 2.30.0 release is scheduled to be cut on April 21
>>> according to the release calendar [1]
>>> > 
>>> >  I'd like to volunteer myself to be the release manager for
>>> this release. I plan on cutting the release branch on the scheduled 
>>> date.
>>> > 
>>> >  Any comments or objections ?
>>> > 
>>> >  Thanks,
>>> >  Heejong
>>> > 
>>> >  [1]
>>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com&ctz=America/Los_Angeles
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Tomo
>>>
>>


Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-14 Thread Wenbing Bai
Thank you for the clarification! Is there a way to control the number of
shards, i.e. the bundle? I know in pure Beam IO connectors, we have
num_shards supported, for example, WriteToParquet

.

Wenbing

On Thu, May 13, 2021 at 4:35 PM Robert Bradshaw  wrote:

> Sharding is determined by the distribution of work. Each worker writes to
> its own shard, and in the case of dynamic partitioning, etc. workers may
> end up processing more than one "bundle" of items and hence produce more
> than one shard. See also
> https://beam.apache.org/documentation/runtime/model/
>
> On Thu, May 13, 2021 at 3:58 PM Wenbing Bai 
> wrote:
>
>> Hi team,
>>
>> I have another question when using Beam Dataframe IO connector. I tried
>> to_parquet, and my data are written to several different files. I am
>> wondering how I can control the number of files (shards) or how the
>> sharding is done for to_parquet and other Beam Dataframe IO APIs?
>>
>> Thank you!
>> Wenbing
>>
>> On Tue, May 11, 2021 at 12:20 PM Kenneth Knowles  wrote:
>>
>>> +dev 
>>>
>>> In the Beam Java ecosystem, this functionality is provided by the Sorter
>>> library (
>>> https://beam.apache.org/documentation/sdks/java-extensions/#sorter).
>>> I'm curious what people think about various options:
>>>
>>>  - Python version of the transform(s)
>>>  - Expose sorter as xlang transform(s)
>>>  - Convenience transforms (that use pandas in DoFns?) to just do it for
>>> small data per key to achieve compatibility
>>>  - Beam model extension so that runners can do it as part of GBK
>>>
>>> Kenn
>>>
>>> On Mon, May 10, 2021 at 5:26 PM Wenbing Bai 
>>> wrote:
>>>
 Hi Robert and Brian,

 I don't know why I didn't catch your replies. But thank you so much for
 looking at this.

 My parquet files will be consumed by downstreaming processes which
 require data points with the same "key1" that are sorted by "key2". The
 downstreaming process, for example, will make a rolling window with size N
 that reads N records together at one time. But note, the rolling window
 will not cross different "key1".

 So that is saying, 1) I don't need to sort the whole dataset. 2) all
 data with the same "key1" should be located together.

 I am not sure if I explain the use case clearly. Let me know what you
 think.

 Wenbing


 On Tue, Apr 20, 2021 at 5:01 PM Robert Bradshaw 
 wrote:

> It would also be helpful to understand what your overall objective is
> with this output. Is there a reason you need it sorted/partitioned in a
> certain way?
>
> On Tue, Apr 20, 2021 at 4:51 PM Brian Hulette 
> wrote:
>
>> Hi Wenbing,
>> Sorry for taking so long to get back to you on this.
>> I discussed this with Robert offline and we came up with a potential
>> workaround - you could try writing out the Parquet file from within the
>> groupby.apply method. You can use beam's FileSystems abstraction to open 
>> a
>> Python file object referencing a cloud storage file, and pass that file
>> object directly to the pandas to_parquet. It would look something like 
>> this:
>>
>>   df.groupby('key1').apply(lambda df:
>> df.sort_values(by='key2').to_parquet(FileSystems.open("gs://bucket/file.pq"))
>>
>> If writing out sorted, partitioned parquet files is a common use-case
>> we should think about making this easier though. At the very least
>> partition_cols should work, I filed BEAM-12201 [1] for this. That alone
>> won't be enough as our implementation will likely reshuffle the dataset 
>> to
>> enforce the partitioning, removing any sorting that you've applied, so 
>> we'd
>> also need to think about how to optimize the pipeline to avoid that 
>> shuffle.
>>
>> Brian
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-12201
>>
>> On Wed, Apr 7, 2021 at 9:02 PM Wenbing Bai 
>> wrote:
>>
>>> Thank you, Brian. I tried `partition_cols`, but it is not working. I
>>> tried pure pandas, it does work, so I am not sure if anything wrong with
>>> Beam.
>>>
>>> Wenbing
>>>
>>> On Wed, Apr 7, 2021 at 2:56 PM Brian Hulette 
>>> wrote:
>>>
 Hm, to_parquet does have a `partition_cols` argument [1] which we
 pass through [2]. It would be interesting to see what
 `partition_cols='key1'` does - I suspect it won't work perfectly 
 though.

 Do you have any thoughts here Robert?

 [1]
 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
 [2]
 https://github.com/apache/beam/blob/a8cd05932bed9b2480316fb8518409636cb2733b/sdks/python/apache_beam/dataframe/io.py#L525

 On Wed, Apr 7, 2021 at 2:22 PM Wen

Flaky test issue report

2021-05-14 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests. These are P1 issues 
because they have a major negative impact on the community and make it hard to 
determine the quality of the software.

BEAM-12322: FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) 
(https://issues.apache.org/jira/browse/BEAM-12322)
BEAM-12309: PubSubIntegrationTest.test_streaming_data_only flake 
(https://issues.apache.org/jira/browse/BEAM-12309)
BEAM-12307: PubSubBigQueryIT.test_file_loads flake 
(https://issues.apache.org/jira/browse/BEAM-12307)
BEAM-12303: Flake in PubSubIntegrationTest.test_streaming_with_attributes 
(https://issues.apache.org/jira/browse/BEAM-12303)
BEAM-12293: FlinkSavepointTest.testSavepointRestoreLegacy flakes due to 
FlinkJobNotFoundException (https://issues.apache.org/jira/browse/BEAM-12293)
BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (https://issues.apache.org/jira/browse/BEAM-12291)
BEAM-12200: SamzaStoreStateInternalsTest is flaky 
(https://issues.apache.org/jira/browse/BEAM-12200)
BEAM-12163: Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK 
harness startup (https://issues.apache.org/jira/browse/BEAM-12163)
BEAM-12061: beam_PostCommit_SQL failing on 
KafkaTableProviderIT.testFakeNested 
(https://issues.apache.org/jira/browse/BEAM-12061)
BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (https://issues.apache.org/jira/browse/BEAM-12019)
BEAM-11792: Python precommit failed (flaked?) installing package  
(https://issues.apache.org/jira/browse/BEAM-11792)
BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (https://issues.apache.org/jira/browse/BEAM-11666)
BEAM-11662: elasticsearch tests failing 
(https://issues.apache.org/jira/browse/BEAM-11662)
BEAM-11661: hdfsIntegrationTest flake: network not found (py38 postcommit) 
(https://issues.apache.org/jira/browse/BEAM-11661)
BEAM-11646: beam_PostCommit_XVR_Spark failing 
(https://issues.apache.org/jira/browse/BEAM-11646)
BEAM-11645: beam_PostCommit_XVR_Flink failing 
(https://issues.apache.org/jira/browse/BEAM-11645)
BEAM-11541: testTeardownCalledAfterExceptionInProcessElement flakes on 
direct runner. (https://issues.apache.org/jira/browse/BEAM-11541)
BEAM-11540: Linter sometimes flakes on apache_beam.dataframe.frames_test 
(https://issues.apache.org/jira/browse/BEAM-11540)
BEAM-10995: Java + Universal Local Runner: 
WindowingTest.testWindowPreservation fails 
(https://issues.apache.org/jira/browse/BEAM-10995)
BEAM-10987: stager_test.py::StagerTest::test_with_main_session flaky on 
windows py3.6,3.7 (https://issues.apache.org/jira/browse/BEAM-10987)
BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-10968)
BEAM-10955: Flink Java Runner test flake: Could not find Flink job  
(https://issues.apache.org/jira/browse/BEAM-10955)
BEAM-10923: Python requirements installation in docker container is flaky 
(https://issues.apache.org/jira/browse/BEAM-10923)
BEAM-10899: test_FhirIO_exportFhirResourcesGcs flake with OOM 
(https://issues.apache.org/jira/browse/BEAM-10899)
BEAM-10866: PortableRunnerTestWithSubprocesses.test_register_finalizations 
flaky on macOS (https://issues.apache.org/jira/browse/BEAM-10866)
BEAM-10590: BigQueryQueryToTableIT flaky: test_big_query_new_types 
(https://issues.apache.org/jira/browse/BEAM-10590)
BEAM-10519: 
MultipleInputsAndOutputTests.testParDoWithSideInputsIsCumulative flaky on Samza 
(https://issues.apache.org/jira/browse/BEAM-10519)
BEAM-10504: Failure / flake in ElasticSearchIOTest > 
testWriteFullAddressing and testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10504)
BEAM-10501: CheckGrafanaStalenessAlerts and PingGrafanaHttpApi fail with 
Connection refused (https://issues.apache.org/jira/browse/BEAM-10501)
BEAM-10485: Failure / flake: ElasticsearchIOTest > testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10485)
BEAM-10272: Failure in CassandraIOTest init: cannot create cluster due to 
netty link error (https://issues.apache.org/jira/browse/BEAM-10272)
BEAM-9649: beam_python_mongoio_load_test started failing due to mismatched 
results (https://issues.apache.org/jira/browse/BEAM-9649)
BEAM-9392: TestStream tests are all flaky 
(https://issues.apache.org/jira/browse/BEAM-9392)
BEAM-9232: BigQueryWriteIntegrationTests is flaky coercing to Unicode 
(https://issues.apache.org/jira/browse/BEAM-9232)
BEAM-9119: 
apache_beam.runners.portability.fn_api_runner_test.FnApiRunnerTest[...].test_large_elements
 is flaky (https://issues.apache.org/jira/browse/BEAM-9119)
BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExce

P1 issues report

2021-05-14 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests.

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

BEAM-12338: Some BQ DisplayData are missing for the Python FnAPI path 
(https://issues.apache.org/jira/browse/BEAM-12338)
BEAM-12324: TranslationsTest.test_run_packable_combine_* failing on 
PostCommit_Py_VR_Dataflow (https://issues.apache.org/jira/browse/BEAM-12324)
BEAM-12320: PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing 
in SQL PostCommit (https://issues.apache.org/jira/browse/BEAM-12320)
BEAM-12316: LGPL in bundled dependencies 
(https://issues.apache.org/jira/browse/BEAM-12316)
BEAM-12310: beam_PostCommit_Java_DataflowV2 failing 
(https://issues.apache.org/jira/browse/BEAM-12310)
BEAM-12308: CrossLanguageKafkaIOTest.test_kafkaio flake 
(https://issues.apache.org/jira/browse/BEAM-12308)
BEAM-12290: TestPubsub.assertThatSubscriptionEventuallyCreated timeout does 
not work (https://issues.apache.org/jira/browse/BEAM-12290)
BEAM-12287: beam_PerformanceTests_Kafka_IO failing due to 
:sdks:java:container:pullLicenses failure 
(https://issues.apache.org/jira/browse/BEAM-12287)
BEAM-12279: Implement destination-dependent sharding in FileIO.writeDynamic 
(https://issues.apache.org/jira/browse/BEAM-12279)
BEAM-12258: SQL postcommit timing out 
(https://issues.apache.org/jira/browse/BEAM-12258)
BEAM-12256: PubsubIO.readAvroGenericRecord creates SchemaCoder that fails 
to decode some Avro logical types 
(https://issues.apache.org/jira/browse/BEAM-12256)
BEAM-12231: beam_PostRelease_NightlySnapshot failing 
(https://issues.apache.org/jira/browse/BEAM-12231)
BEAM-11959: Python Beam SDK Harness hangs when installing pip packages 
(https://issues.apache.org/jira/browse/BEAM-11959)
BEAM-11906: No trigger early repeatedly for session windows 
(https://issues.apache.org/jira/browse/BEAM-11906)
BEAM-11875: XmlIO.Read does not handle XML encoding per spec 
(https://issues.apache.org/jira/browse/BEAM-11875)
BEAM-11828: JmsIO is not acknowledging messages correctly 
(https://issues.apache.org/jira/browse/BEAM-11828)
BEAM-11755: Cross-language consistency (RequiresStableInputs) is quietly 
broken (at least on portable flink runner) 
(https://issues.apache.org/jira/browse/BEAM-11755)
BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when int 
overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow Runner 
(https://issues.apache.org/jira/browse/BEAM-11576)
BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor 
(https://issues.apache.org/jira/browse/BEAM-11434)
BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 
(https://issues.apache.org/jira/browse/BEAM-11227)
BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink 
(https://issues.apache.org/jira/browse/BEAM-11148)
BEAM-11017: Timer with dataflow runner can be set multiple times (dataflow 
runner) (https://issues.apache.org/jira/browse/BEAM-11017)
BEAM-10861: Adds URNs and payloads to PubSub transforms 
(https://issues.apache.org/jira/browse/BEAM-10861)
BEAM-10670: Make non-portable Splittable DoFn the only option when 
executing Java "Read" transforms 
(https://issues.apache.org/jira/browse/BEAM-10670)
BEAM-10617: python CombineGlobally().with_fanout() cause duplicate combine 
results for sliding windows (https://issues.apache.org/jira/browse/BEAM-10617)
BEAM-10569: SpannerIO tests don't actually assert anything. 
(https://issues.apache.org/jira/browse/BEAM-10569)
BEAM-10288: Quickstart documents are out of date 
(https://issues.apache.org/jira/browse/BEAM-10288)
BEAM-10244: Populate requirements cache fails on poetry-based packages 
(https://issues.apache.org/jira/browse/BEAM-10244)
BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all data 
(https://issues.apache.org/jira/browse/BEAM-10100)
BEAM-9564: Remove insecure ssl options from MongoDBIO 
(https://issues.apache.org/jira/browse/BEAM-9564)
BEAM-9455: Environment-sensitive provisioning for Dataflow 
(https://issues.apache.org/jira/browse/BEAM-9455)
BEAM-9293: Python direct runner doesn't emit empty pane when it should 
(https://issues.apache.org/jira/browse/BEAM-9293)
BEAM-8986: SortValues may not work correct for numerical types 
(https://issues.apache.org/jira/browse/BEAM-8986)
BEAM-8985: SortValues should fail if SecondaryKey coder is not 
deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get marked 
as passing (Direct Runner) (https://issues.apache.org/jira/browse/BEAM-8407)
BEAM-7717: PubsubIO watermark tracking hovers near start of epoch 
(https://issues.apache.org/jira/browse/BEAM-7717)
BEAM-7716: PubsubIO returns empty message bodies for

Beam Learning Month - Looking for speakers

2021-05-14 Thread Mara Ruvalcaba

Hi Apache Beam Community! :beam's-mascot-1:

We are organizing a Beam learning month, that will happen during the 
Wednesdays of June, and we are looking for speakers who want to share 
their use cases on Apache Beam.


Remember that to share your experience will help others to learn and 
solve their issues, will give you more experience as speakers, and it 
will make you a better contributor =)


Please contact @Brittany Hermann 
  if you are interested in 
sharing your knowledge with the community.


--
Mara Ruvalcaba
COO, SG Software Guru & Nearshore Link
USA: 512 296 2884
MX: 55 5239 5502



Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-14 Thread Kyle Weaver
1. and 2. are both facilitated by GRPC, which takes care of most of the
retry/wait logic. In some places we have a configurable timeout (which
defaults to 60s) [1], while in other places we block [2][3].

[1] https://issues.apache.org/jira/browse/BEAM-7933
[2]
https://github.com/apache/beam/blob/51541a595b09751dd3dde2c50caf2a968ac01b68/sdks/python/apache_beam/runners/portability/portable_runner.py#L238-L242
[3]
https://github.com/apache/beam/blob/9601bdef8870bc6acc7895c06252e43ec040bd8c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ExternalEnvironmentFactory.java#L115

On Fri, May 14, 2021 at 10:51 AM Ke Wu  wrote:

> Hello All,
>
> I came across this question when I am reading Beam on Flink on Kubernetes
> 
>  and
> flink-on-k8s-operator
> 
>  and
> realized that there seems no retry/wait logic built in PortableRunner
> nor ExternalEnvironmentFactory, (correct me if I am wrong) which creates
> implications that:
>
> 1. Job Server needs to be ready to accept request before SDK Client could
> submit request.
> 2. External Worker Pool Service needs to be ready to accept start/stop
> worker request before runner starts to request.
>
> This may bring some challenges on k8s since Flink opt to use multi
> containers pattern when bringing up a beam portable pipeline, in addition,
> I don’t find any special lifecycle management in place to guarantee the
> order, e.g. External Worker Pool Service container to start and ready
> before the task manager container to start making requests.
>
> I am wondering if I missed anything to guarantee the readiness of the
> dependent service or we are relying on that dependent containers are much
> lighter weigh so it should, in most time, be ready before the other
> container start to make requests.
>
> Best,
> Ke
>
>


Re: BEAM-11838: An x-lang wrapper for DebeziumIO

2021-05-14 Thread Benjamin Gonzalez Delgado
Hi Pablo!
Did you have a chance to execute the test to find something rare?
Do you know if there is a way to use breakpoints when running this kind of
test I mean breakpoints in java code when running a pipeline from python?


On Fri, May 7, 2021 at 4:54 PM Benjamin Gonzalez Delgado <
benjamin.gonza...@wizeline.com> wrote:

> Hi Pablo!,
>
>
> I run the gradle task
>
> ./gradlew :sdks:python:test-suites:portable:py37:postCommitPy37IT
>
>
> You can comment the other test in
> sdks/python/test-suites/portable/common.gradle
> 
>  for
> a faster run.
>
> Thanks.
>
> On Fri, May 7, 2021 at 4:42 PM Pablo Estrada  wrote:
>
>> Hi Benjamin!
>> How do I run the test on your PR? I can try to run it to see what's going
>> on.[
>> Best
>> -P.
>>
>> On Thu, May 6, 2021 at 1:46 PM Pablo Estrada  wrote:
>>
>>> Hi Benjamin!
>>> That's strange if the Java test is working well that the Python/xlang
>>> test would run into issues. Let me take a look at your PR and see if I can
>>> see anything suspicious.
>>> Best
>>> -P.
>>>
>>> On Thu, May 6, 2021 at 10:49 AM Benjamin Gonzalez Delgado <
>>> benjamin.gonza...@wizeline.com> wrote:
>>>
 Hi team,
 I am working on BEAM-11838[1], but I am a little stuck right now[2], my
 issue is that when I set param maxNumberOfRecords through
 ExternalTransformBuilder(L117) its sets the value correctly to the
 constructor on KafkaSourceConsumerFn (L111), but when it arrives at
 tryClaim method(L307) the value for maxNumberOfRecords is null again and it
 keeps running forever so my test never ends.
 I wonder if some of you have experience with KafkaSourceConsumerFn
 class or creating other cross-language transforms that could help.
 Any guidance on this would be appreciated.

 PD. The existing java test DebeziumIOMySqlConnectorIT does not have the
 same problem.

 [1] https://issues.apache.org/jira/browse/BEAM-11838
 [2] https://github.com/benWize/beam/pull/12/files#









 *This email and its contents (including any attachments) are being sent
 toyou on the condition of confidentiality and may be protected by
 legalprivilege. Access to this email by anyone other than the intended
 recipientis unauthorized. If you are not the intended recipient, please
 immediatelynotify the sender by replying to this message and delete the
 materialimmediately from your system. Any further use, dissemination,
 distributionor reproduction of this email is strictly prohibited. Further,
 norepresentation is made with respect to any content contained in this
 email.*
>>>
>>>

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


[DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-14 Thread Ke Wu
Hello All,

I came across this question when I am reading Beam on Flink on Kubernetes 

 and flink-on-k8s-operator 

 and realized that there seems no retry/wait logic built in PortableRunner nor 
ExternalEnvironmentFactory, (correct me if I am wrong) which creates 
implications that:

1. Job Server needs to be ready to accept request before SDK Client could 
submit request.
2. External Worker Pool Service needs to be ready to accept start/stop worker 
request before runner starts to request.

This may bring some challenges on k8s since Flink opt to use multi containers 
pattern when bringing up a beam portable pipeline, in addition, I don’t find 
any special lifecycle management in place to guarantee the order, e.g. External 
Worker Pool Service container to start and ready before the task manager 
container to start making requests. 

I am wondering if I missed anything to guarantee the readiness of the dependent 
service or we are relying on that dependent containers are much lighter weigh 
so it should, in most time, be ready before the other container start to make 
requests. 

Best,
Ke



Re: Extremely Slow DirectRunner

2021-05-14 Thread Evan Galpin
Any further thoughts here? Or tips on profiling Beam DirectRunner?

Thanks,
Evan

On Wed, May 12, 2021 at 6:22 PM Evan Galpin  wrote:

> Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow
> behaviour regardless of use_deprecated_reads. Not sure if that points to
> something different then.
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 18:16 Steve Niemitz  wrote:
>
>> I think it was only broken in 2.29.
>>
>> On Wed, May 12, 2021 at 5:53 PM Evan Galpin 
>> wrote:
>>
>>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
>>> specifically in 2.29.0 (regression) or broken in all versions up to and
>>> including 2.29.0 (ie never worked)?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 17:12 Steve Niemitz  wrote:
>>>
 Yeah, sorry my email was confusing.  use_deprecated_reads is broken on
 the DirectRunner in 2.29.

 The behavior you describe is exactly the behavior I ran into as well
 when reading from pubsub with the new read method.  I believe that soon the
 default is being reverted back to the old read method, not using SDFs,
 which will fix your performance issue.

 On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang 
 wrote:

> Hi Evan,
>
> It seems like the slow step is not the read that use_deprecated_read
> targets for. Would you like to share your pipeline code if possible?
>
> On Wed, May 12, 2021 at 1:35 PM Evan Galpin 
> wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. Is it possible that use_deprecated_read is
>> broken in 2.29.0 as well?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
>> wrote:
>>
>>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>>
>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>>> wrote:
>>>
 Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the
 "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at
 v2.25.0 (for my use case at least) regardless of use_deprecated_read
 setting.

 Thanks,
 Evan


 On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
 wrote:

> use_deprecated_read was broken in 2.19 on the direct runner and
> didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
> will
> be in 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
> wrote:
>
>> I forgot to also mention that in all tests I was setting
>> --experiments=use_deprecated_read
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin <
>> evan.gal...@gmail.com> wrote:
>>
>>> Hmm, I think I spoke too soon. I'm still seeing an issue of
>>> overall DirectRunner slowness, not just pubsub. I have a pipeline 
>>> like so:
>>>
>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()
>>> |  FileIO.readMatches()  |  Read file contents  |  etc
>>>
>>> I have temporarily set up a transform between each step to log
>>> what's going on and illustrate timing issues.  I ran a series of 
>>> tests
>>> changing only the SDK version each time since I hadn't noticed this
>>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>>> test, I
>>> seeded the pubsub subscription with the exact same contents.
>>>
>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I
>>> couldn't seem to resolve) and onward show a significant slowdown.
>>>
>>> Here is a snippet of logging from v2.25.0:
>>>
>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:16:59 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Note th

Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-05-14 Thread Evan Galpin
Not sure if it's on the radar but there was mention on this Reshuffle PR[1]
of cherry-picking it to 2.30.0

[1] https://github.com/apache/beam/pull/14720

Thanks,
Evan

On Thu, May 13, 2021 at 6:50 PM Heejong Lee  wrote:

> UPDATE:
>
> All precommit and postcommit tests are passed now:
> https://github.com/apache/beam/pull/14711
>
> We only have one open issue for Fix Version 2.30.0:
> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>
> I will start building the RC release after we cherry-pick the last blocker.
>
> On Thu, Apr 29, 2021 at 12:48 AM Heejong Lee  wrote:
>
>> We have 10 open issues for Fix Version 2.30.0:
>> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>>
>> On Thu, Apr 29, 2021 at 12:30 AM Heejong Lee  wrote:
>>
>>> FYI, I just cut the 2.30.0 release branch. From now on, late commits for
>>> 2.30.0 need to be cherry-picked. If you have any late commits, please make
>>> sure that their Jira issues have the correct Fix Version, 2.30.0.
>>>
>>> On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles  wrote:
>>>
 SGTM. Thanks!

 On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee  wrote:

>
>
> On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
> wrote:
>
>> Confirming that the cut date is 4/28/2021 (in two days), right?
>>
>
> Yes, 2.30.0 branch is scheduled to be cut on April 28.
>
>
>>
>> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki 
>> wrote:
>> >
>> > Thank you for the preparation!
>> >
>> > > a few responses that some high priority changes
>> >
>> > Would you be willing to share the items for visibility?
>>
>> There are several PRs in flight (or recently merged) to get
>> portability working well with Dataflow for this release.
>>
>
> We can still cherry-pick them by importance after the branch cut.
>
>
>>
>> >
>> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
>> wrote:
>> > >
>> > > Also the 2.29.0 was re-cut.
>> > >
>> > > Usually a delay in one release should not delay the next release,
>> because each release represents a certain quantity of changes. But in 
>> this
>> case, the actual quantity of changes is affected by the re-cut, too.
>> > >
>> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
>> wrote:
>> > >>
>> > >> Update on the 2.30.0 branch cut schedule:
>> > >>
>> > >> I'm thinking of delaying the branch cut a week since I've got a
>> few responses that some high priority changes are still ongoing.
>> > >>
>> > >> The new cut date is April 28.
>> > >>
>> > >>
>> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
>> wrote:
>> > >>>
>> > >>> +1 and thank you!
>> > >>>
>> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
>> wrote:
>> > 
>> >  Hi All,
>> > 
>> >  Beam 2.30.0 release is scheduled to be cut on April 21
>> according to the release calendar [1]
>> > 
>> >  I'd like to volunteer myself to be the release manager for
>> this release. I plan on cutting the release branch on the scheduled date.
>> > 
>> >  Any comments or objections ?
>> > 
>> >  Thanks,
>> >  Heejong
>> > 
>> >  [1]
>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com&ctz=America/Los_Angeles
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Tomo
>>
>