Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-10-12 Thread Luke Cwik
I have a draft[1] off the blog ready. Please take a look.

1:
http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo

On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik  wrote:

>
>
> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik  wrote:
>>
>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will
>>> use SDF powered Read transforms. Users can opt-out
>>> with --experiments=use_deprecated_read.
>>>
>>
>> Huzzah! In our release notes maybe be clear about the expectations for
>> users:
>>
>> Done in https://github.com/apache/beam/pull/13015
>
>
>>  - semantics are expected to be the same: file bugs for any change in
>> results
>>  - perf may vary: file bugs or write to user@
>>
>> I was unable to get Spark done for 2.25 as I found out that Spark
>>> streaming doesn't support watermark holds[1]. If someone knows more about
>>> the watermark system in Spark I could use some guidance here as I believe I
>>> have a version of unbounded SDF support written for Spark (I get all the
>>> expected output from tests, just that watermarks aren't being held back so
>>> PAssert fails).
>>>
>>
>> Spark's watermarks are not comparable to Beam's. The rule as I understand
>> it is that any data that is later than `max(seen timestamps) -
>> allowedLateness` is dropped. One difference is that dropping is relative to
>> the watermark instead of expiring windows, like early versions of Beam. The
>> other difference is that it track the latest event (some call it a "high
>> water mark" because it is the highest datetime value seen) where Beam's
>> watermark is an approximation of the earliest (some call it a "low water
>> mark" because it is a guarantee that it will not dip lower). When I chatted
>> about this with Amit in the early days, it was necessary to implement a
>> Beam-style watermark using Spark state. I think that may still be the case,
>> for correct results.
>>
>>
> In the Spark implementation I saw that watermark holds weren't wired at
> all to control Sparks watermarks and this was causing triggers to fire too
> early.
>
>
>> Also, I started a doc[2] to produce an updated blog post since the
>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of
>>> making this a new blog post and having the old blog post point to it. We
>>> could also remove the old blog post and or update it. Any thoughts?
>>>
>>
>> New blog post w/ pointer from the old one.
>>
>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
>>> expansion into each of the runners instead of having it within Read
>>> transform within beam-sdks-java-core.
>>>
>>
>> Approved! I did CC a bunch of runner authors already. I think the
>> important thing is if a default changes we should be sure everyone is OK
>> with the perf changes, and everyone is confident that no incorrect results
>> are produced. The abstractions between sdk-core, runners-core-*, and
>> individual runners is important to me:
>>
>>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving
>> flags out of SDK core (and IOs) ASAP is super important.
>>  - The runner's job is to execute that pipeline, if they can, however
>> they want. If a runner wants to run Read transforms differently/directly
>> that is fine. If a runner is incapable of supporting SDF, then Read is
>> better than nothing. Etc.
>>  - The runners-core-* job is to just be internal libraries for runner
>> authors to share code, and should not make any decisions about the Beam
>> model, etc.
>>
>> Kenn
>>
>> 1: https://github.com/apache/beam/pull/12603
>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>>> 3: https://beam.apache.org/blog/splittable-do-fn/
>>> 4: https://github.com/apache/beam/pull/13006
>>>
>>>
>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels 
>>> wrote:
>>>
 Thanks Luke! I've had a pass.

 -Max

 On 28.08.20 01:22, Luke Cwik wrote:
 > As an update.
 >
 > Direct and Twister2 are done.
 > Samza: is ready for review[1].
 > Flink: is almost ready for review. [2] lays all the groundwork for
 the
 > migration and [3] finishes the migration (there is a timeout
 happening
 > in FlinkSubmissionTest that I'm trying to figure out).
 > No further updates on Spark[4] or Jet[5].
 >
 > @Maximilian Michels  or @t...@apache.org
 > , can either of you take a look at
 the
 > Flink PRs?
 > @ke.wu...@icloud.com , Since Xinyu
 delegated
 > to you, can you take another look at the Samza PR?
 >
 > 1: https://github.com/apache/beam/pull/12617
 > 2: https://github.com/apache/beam/pull/12706
 > 3: https://github.com/apache/beam/pull/12708
 > 4: https://github.com/apache/beam/pull/12603
 > 5: https://github.com/apache/beam/pull/12616
 >
 > On Tue, Aug 18, 2020 at 

Updating elasticsearch version to 7.9.2 - problem with HadoopFormatIOElasticTest that uses ES emulator

2020-10-12 Thread Piotr Szuberski
I'm trying to update elasticsearch version to 7.9.2 but I've encountered a 
problem with HadoopFormatIOElasticTest that uses ES in-memory emulator that is 
no longer supported:
https://stackoverflow.com/questions/51316813/elastic-node-on-local-in-6-2

It's recommended to use testcontainers as proposed here 
https://github.com/allegro/embedded-elasticsearch but it would transform the 
in-memory test to integration test (which has to be done anyway)

There is also Elasticsearch test framework with ESSingleNodeTestCase but it 
causes Jar Hell problem and I don't think it's easily solvable - the 
dependencies in "java core" and "java core test".
I tried to 

Is running the precommit test with testcontainers acceptable? It's the easiest 
fix.

About the integration test:
I'd like to enable the IT test in Java PostCommit but there are some 
assumptions about the data that is already written to Elasticsearch but I can't 
find anywhere what that data should be (Probably something like Item_Price0, 
Item_Price1 etc but I'm not sure)


Update elasticsearch version - problem in hadoop-format

2020-10-12 Thread Piotr Szuberski
I'm trying to update elasticsearch dependencies and I'm blocked on 
hadoop-format HadoopFormatIOElasticTest.

The tests use elasticsearch emulator that is no longer supported as is said 
here: 
https://discuss.elastic.co/t/in-memory-testing-with-resthighlevelclient/106196/6

There is a test framework with ESSingleNodeTestCase but it has a jar hell check 
which crashes on compiling on both compile and testCompile sdks:java:core and I 
don't see an easy solution here. I tried to write my own JarHell class in 
org.elasticsearch.bootstrap package, but without success - the test gets into 
an infinite loop.

The easiest solution would be to use testcontainers but then the test becomes 
an integration test - would it be acceptable to run such a test in Java 
Precommit?


I also encountered a problem fixing HadoopFormatIOElasticIT test - the test 
assumes that there is an Elasticsearch instance already running with some data. 
But there is nowhere written what that data should look like. I suppose it's 
something like Txn_ID0, Txn_ID1 but I didn't manage to get testHifIOWithElastic 
test pass.


Beam Dependency Check Report (2020-10-12)

2020-10-12 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
chromedriver-binary
83.0.4103.39.0
86.0.4240.22.0
2020-07-08
2020-09-07BEAM-10426
google-cloud-bigquery
1.28.0
2.1.0
2020-10-05
2020-10-12BEAM-5537
google-cloud-dlp
1.0.0
2.0.0
2020-06-29
2020-10-05BEAM-10344
google-cloud-pubsub
1.7.0
2.1.0
2020-07-20
2020-10-05BEAM-5539
google-cloud-vision
1.0.0
2.0.0
2020-03-24
2020-10-05BEAM-9581
mock
2.0.0
4.0.2
2019-05-20
2020-10-05BEAM-7369
mypy-protobuf
1.18
1.23
2020-03-24
2020-06-29BEAM-10346
nbconvert
5.6.1
6.0.7
2020-10-05
2020-10-05BEAM-11007
pyarrow
0.17.1
1.0.1
2020-07-27
2020-08-24BEAM-10582
PyHamcrest
1.10.1
2.0.2
2020-01-20
2020-07-08BEAM-9155
pytest
4.6.11
6.1.1
2020-07-08
2020-10-05BEAM-8606
pytest-xdist
1.34.0
2.1.0
2020-08-17
2020-08-28BEAM-10713
tenacity
5.1.5
6.2.0
2019-11-11
2020-06-29BEAM-8607
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.amazonaws:amazon-kinesis-producer
0.13.1
0.14.1
2019-07-31
2020-07-31BEAM-10628
com.azure:azure-storage-blob
12.1.0
12.9.0-beta.2
2019-12-05
2020-10-08BEAM-10800
com.datastax.cassandra:cassandra-driver-core
3.8.0
4.0.0
2019-10-29
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.0.0-RC9
2018-03-20
2020-08-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.29.0
0.33.0
2020-07-20
2020-09-14BEAM-6645
com.google.api.grpc:grpc-google-cloud-pubsub-v1
1.85.1
1.90.3
2020-03-09
2020-10-05BEAM-8677
com.google.api.grpc:grpc-google-cloud-pubsublite-v1
0.1.6
0.4.1
2020-05-28
2020-09-28BEAM-11008
com.google.api.grpc:grpc-google-common-protos
1.12.0
1.18.1
2018-06-29
2020-08-11BEAM-8633
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1beta1
0.85.1
0.105.5
2020-01-08
2020-10-09BEAM-8678
com.google.api.grpc:proto-google-cloud-bigtable-v2
1.9.1
1.16.1
2020-01-10
2020-10-08BEAM-8679
com.google.api.grpc:proto-google-cloud-datastore-v1
0.85.0
0.88.0
2019-12-05
2020-09-17BEAM-8680
com.google.api.grpc:proto-google-cloud-pubsub-v1
1.85.1
1.90.3
2020-03-09
2020-10-05BEAM-8681
com.google.api.grpc:proto-google-cloud-pubsublite-v1
0.1.6
0.4.1
2020-05-28
2020-09-28BEAM-11009
com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1
1.59.0
2.0.2
2020-07-16
2020-10-02BEAM-8682
com.google.apis:google-api-services-bigquery
v2-rev20200719-1.30.10
v2-rev20200925-1.30.10
2020-07-26
2020-10-01BEAM-8684
com.google.apis:google-api-services-clouddebugger
v2-rev20200501-1.30.10
v2-rev20200807-1.30.10
2020-07-14
2020-08-17BEAM-8750
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20200720-1.30.10
v2-rev20200831-1.30.10
2020-07-25
2020-09-03BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20200713-1.30.10
v1beta3-rev12-1.20.0
2020-07-25
2015-04-29BEAM-8752
com.google.apis:google-api-services-healthcare
v1beta1-rev20200713-1.30.10
v1-rev20200917-1.30.10
2020-07-24
2020-09-23BEAM-10349
com.google.apis:google-api-services-pubsub
v1-rev20200713-1.30.10
v1-rev20200909-1.30.10
2020-07-25
2020-09-18BEAM-8753
com.google.apis:google-api-services-storage
v1-rev20200611-1.30.10
v1-rev20200927-1.30.10
2020-07-10
2020-10-03BEAM-8754
com.google.auto.service:auto-service
1.0-rc6
1.0-rc7
2019-07-16
 

Re: Requesting contributor permissions for jira tickets

2020-10-12 Thread Alexey Romanenko
Hi Dominik,

I added you to Contributors list, now you can assign Jira issues to yourself.

Also, please take a look on Beam Contribution Guide:
https://beam.apache.org/contribute/

Welcome to Beam!

Regards,
Alexey

> On 9 Oct 2020, at 19:29, Dominik Schöneweiß 
>  wrote:
> 
> Hi everyone,
> 
> my name is Dominik and I’m working on different beam projects at my day job.
> I would like to contribute to the code base and wanted to request jira 
> permissions.
> 
> username: nomnom
> 
> Thanks!
> -- 
> 
> 
> advanced store GmbH
> Alte Jakobstraße 79/80
> D-10179 Berlin
> 
> www.advanced-store.com 
> 
> Tel: +49 (0)30 577 
> 066 020
> 
> Fax: +49 (0)30 577 066 029
> 
> Gesellschaft mit beschränkter Haftung
> 
> mit Sitz in Berlin, Amtsgericht Charlottenburg
> Handelsregister: HRB 115601 
> B
> UST ID: DE261726838
> 
> Geschäftsführer: Marc Majewski
> 
> 
> VERTRAULICHKEITSHINWEIS: Diese Nachricht ist vertraulich. Sie darf 
> ausschließlich durch den vorgesehenen Empfänger und Adressaten gelesen, 
> kopiert oder genutzt werden. Sollten Sie diese Nachricht versehentlich 
> erhalten haben, bitten wir, den Absender (durch Antwort-E-Mail) hiervon 
> unverzüglich zu informieren und die Nachricht zu löschen. Jede Nutzung oder 
> Weitergabe des Inhalts dieser Nachricht ist unzulässig.
> 
> 
> CONFIDENTIALITY 
> NOTICE: This message (including any attachments) is confidential and may be 
> privileged. It may be read, copied and used only by the intended recipient. 
> If you have received it in error please contact the sender (by return 
> e-mail) immediately and delete this message. Any unauthorized use or 
> dissemination of this message in whole or in part is strictly prohibited.
> 
>