Re: Question about E2E tests for pipelines

2020-12-01 Thread Brian Hulette
KafkaIOIT can also use testcontainers to create a fake Kafka service [1].
In theory we could do something similar to test PubSubIO by using the
provided emulator [2], but currently we only test against the production
pubsub service. There's a useful TestPubsub [3] rule which can be used to
create a test PubSub topic that you can read or write to and make
assertions against. Jenkins has permissions to use the apache-beam-testing
project to create these topics on GCP.

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOIT.java#L336
[2] https://cloud.google.com/pubsub/docs/emulator
[3]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/TestPubsub.java

On Thu, Nov 26, 2020 at 9:37 AM Artur Khanin 
wrote:

> Thank you for the information and links, Alexey! We will try to follow
> this approach.
>
> On 25 Nov 2020, at 21:27, Alexey Romanenko 
> wrote:
>
> For Kafka testing, there is a Kafka IT [1] that runs on Jenkins [2]. It
> leverages a real Kafka cluster that runs on k8s. So, probably you can
> follow the similar approach.
>
> In the same time, we fake Kafka consumer and its output for KafkaIO unit
> tests.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOIT.java
> [2]
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_KafkaIO_IT.groovy
>
>
> On 25 Nov 2020, at 13:05, Artur Khanin  wrote:
>
> Hi Devs,
>
> We are finalizing this PR  with
> a pipeline that reads from Kafka and writes to Pub/Sub without any
> transformations in between. We would like to implement e2e tests where we
> create and execute a pipeline, but we haven't found much information and
> relevant examples about it.How exactly should we implement such kind of
> tests? Can we mock somehow Kafka and Pub/Sub or maybe can we set them up
> using some test environment?
>
> Any information and hints will be greatly appreciated!
>
> Thanks,
> Artur Khanin
> Akvelon, Inc
>
>
>
>


Re: Any interest in sharding targets?

2020-12-01 Thread Daniel Collins
> High-level: ensure you have gradle cache enabled so only the first build
is slow. If you encounter nondeterministic or noncached targets upstream of
the module you are editing, that's worth discussing and probably fixing.

I do have caching enabled (I do not rebuild non-gcp targets every time)

> Can you share the exact gradle command?

Not sure on gradle syntax, I'm running from intellij. But I think it is
"gradle build --scan :sdks:java:io:google-cloud-platform" for
the most recent run, will attach a scan soon.

> from clean (8m)

Sounds like you have a really strong desktop :)

> But most things aren't rebuilt anyhow.

I'd wonder how much core being a monolithic target affects presubmit times.
Wouldn't these all have to be rebuilt on every new build, or is there
caching there as well?

But primarily core build times just adds latency to deciding to work on
beam -> starting work on beam, you're right. Its the google-cloud-platform
target whose size impedes my workflow the most.

> That's going to be a separate issue from wanting to build a single part
of the GCP IO package without building the rest of the package

It sounds like you'd be open to splitting up this target? Or am I reading
the rest of your comment incorrectly?


On Tue, Dec 1, 2020 at 10:28 AM Kenneth Knowles  wrote:

> High-level: ensure you have gradle cache enabled so only the first build
> is slow. If you encounter nondeterministic or noncached targets upstream of
> the module you are editing, that's worth discussing and probably fixing.
>
> That's going to be a separate issue from wanting to build a single part of
> the GCP IO package without building the rest of the package. Details and
> questions below.
>
> On Mon, Nov 30, 2020 at 4:36 PM Daniel Collins 
> wrote:
>
>> Hello all,
>>
>> Any time I have the misfortune of creating a new beam branch, building a
>> subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
>> O(30 mins) on my laptop.
>>
>
> Can you share the exact gradle command?
>
>
>> A lot of the steps seem to block on each other and even the leaf rebuild
>> can take minutes since all the GCP I/O transforms are in one target. A
>> couple of questions for the (hopefully?) gradle experts here:
>>
>> 1) Do you think that sharding these targets would increase parallelism in
>> the underlying build?
>>
>
> I'd start with --scan so you can see some details and share it with others
> easily. I'm not sure if --profile gives even finer-grained telemetry.
>
> To demonstrate, here are two scans of `./gradlew
> :sdks:java:io:google-cloud-platform:compileTestJava`:
>
>  - from clean (8m): https://scans.gradle.com/s/j5jtqywn3uw4o/timeline
>  - after modifying a file in the module (1m):
> https://gradle.com/s/g74hsjddl6x5g/timeline
>
> These are certainly slow, and there are decidedly nonideal bits in the dep
> graph (most of the execution-oriented bits should not be needed to just
> *compile* the tests). But most things aren't rebuilt anyhow.
>
> 2) Do you think doing so would have any knock-on negative effects, either
>> for compilation time or development speed?
>>
>
> The answer is always "avoid rebuilding" so smaller seems better. I'm not
> totally clear how much is to be gained in this case.
>
> The other answer is -PskipCheckerFramework which will net you a 4x speedup
> in Java compile time, at the cost of you probably having to rewrite your
> code once you un-disable it and discover you've got a bunch of lurking NPEs.
>
> Kenn
>
>
>> 3) Do you think this would be an hours, days or weeks time investment to
>> do?
>>
>
>
>>
>> The above implicitly comes with "willing to help out O(hours/days), but
>> no gradle knowledge so I would need some guidance".
>>
>> -Dan
>>
>


Tests for compatibility with Avro 1.8 and 1.9

2020-12-01 Thread Piotr Szuberski
I'd like to add tests verifying that Beam is compatible with both Avro 1.8 and 
1.9 similar to what has been done to Hadoop and Kafka.

Probably all Avro dependencies would have to be changed from compile to 
provided - won't it be problematic for users? They will be broken after the 
update unless they add Avro dependency. On the other hand they'll be able to 
choose which version do they prefer.

At the moment Beam doesn't work with Avro 1.10 so users will be resticted to 
use either 1.8 or 1.9.

Does changing Avro dependencies to provided sounds reasonable? Are there 
particular modules that should not be changed? Or is there a better approach?


Re: ElasticsearchIO delete document

2020-12-01 Thread Arif Alili
Hi  Jithin,

Updating Beam to 2.25.0 and using "withIsDeleteFn" works as expected.
Thanks a lot for the detailed answer!

Best,
Arif

On Tue, Dec 1, 2020 at 11:30 AM JITHIN SUKUMAR 
wrote:

> Hi Arif Alili,
>
> Deleting documents using ElasticsearchIO [1] was included in Apache Beam
> since version 2.25.0 [2]. You can check out the javadoc [3] or some example
> implementations [4].
>
> References:
> [1]: https://issues.apache.org/jira/browse/BEAM-5757
> [2]: https://beam.apache.org/blog/beam-2.25.0/
> [3]:
> https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html#withIsDeleteFn-org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.Write.BooleanFieldValueExtractFn-
> [4]:
> https://github.com/apache/beam/blob/release-2.25.0/sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java#L667-L749
>
> Hope that helps!
> Regards,
> Jithin
>
> On Tue, Dec 1, 2020 at 6:18 PM Arif Alili  wrote:
>
>> Hi all,
>>
>> I am writing to Elasticsearch using Beam (Google Dataflow) class
>> ElasticsearchIO. Creating indexes and writing document goes well, however,
>> I am struggling to find a way to delete Elasticsearch documents, looking at
>> this documentation
>> https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html,
>> I see there's only read() and write() methods.
>>
>> Anyone has any suggestions how to delete Elasticsearch documents using
>> Beam's ElasticsearchIO?
>>
>> Best,
>> Arif
>>
>

-- 
*Arif Alili*

A:

T:
E:
I:


Pilotenstraat 43 bg
1059 CH Amsterdam
020 - 6 71 71 71 <+31+20+6+71+71+71>
a.al...@propellor.eu <[name]@propellor.eu>
www.propellor.eu







Re: Any interest in sharding targets?

2020-12-01 Thread Kenneth Knowles
High-level: ensure you have gradle cache enabled so only the first build is
slow. If you encounter nondeterministic or noncached targets upstream of
the module you are editing, that's worth discussing and probably fixing.

That's going to be a separate issue from wanting to build a single part of
the GCP IO package without building the rest of the package. Details and
questions below.

On Mon, Nov 30, 2020 at 4:36 PM Daniel Collins  wrote:

> Hello all,
>
> Any time I have the misfortune of creating a new beam branch, building a
> subtarget (sdks/io/google-cloud-platform/.../pubsublite in my case) takes
> O(30 mins) on my laptop.
>

Can you share the exact gradle command?


> A lot of the steps seem to block on each other and even the leaf rebuild
> can take minutes since all the GCP I/O transforms are in one target. A
> couple of questions for the (hopefully?) gradle experts here:
>
> 1) Do you think that sharding these targets would increase parallelism in
> the underlying build?
>

I'd start with --scan so you can see some details and share it with others
easily. I'm not sure if --profile gives even finer-grained telemetry.

To demonstrate, here are two scans of `./gradlew
:sdks:java:io:google-cloud-platform:compileTestJava`:

 - from clean (8m): https://scans.gradle.com/s/j5jtqywn3uw4o/timeline
 - after modifying a file in the module (1m):
https://gradle.com/s/g74hsjddl6x5g/timeline

These are certainly slow, and there are decidedly nonideal bits in the dep
graph (most of the execution-oriented bits should not be needed to just
*compile* the tests). But most things aren't rebuilt anyhow.

2) Do you think doing so would have any knock-on negative effects, either
> for compilation time or development speed?
>

The answer is always "avoid rebuilding" so smaller seems better. I'm not
totally clear how much is to be gained in this case.

The other answer is -PskipCheckerFramework which will net you a 4x speedup
in Java compile time, at the cost of you probably having to rewrite your
code once you un-disable it and discover you've got a bunch of lurking NPEs.

Kenn


> 3) Do you think this would be an hours, days or weeks time investment to
> do?
>


>
> The above implicitly comes with "willing to help out O(hours/days), but no
> gradle knowledge so I would need some guidance".
>
> -Dan
>


Re: ElasticsearchIO delete document

2020-12-01 Thread JITHIN SUKUMAR
Hi Arif Alili,

Deleting documents using ElasticsearchIO [1] was included in Apache Beam
since version 2.25.0 [2]. You can check out the javadoc [3] or some example
implementations [4].

References:
[1]: https://issues.apache.org/jira/browse/BEAM-5757
[2]: https://beam.apache.org/blog/beam-2.25.0/
[3]:
https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html#withIsDeleteFn-org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.Write.BooleanFieldValueExtractFn-
[4]:
https://github.com/apache/beam/blob/release-2.25.0/sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java#L667-L749

Hope that helps!
Regards,
Jithin

On Tue, Dec 1, 2020 at 6:18 PM Arif Alili  wrote:

> Hi all,
>
> I am writing to Elasticsearch using Beam (Google Dataflow) class
> ElasticsearchIO. Creating indexes and writing document goes well, however,
> I am struggling to find a way to delete Elasticsearch documents, looking at
> this documentation
> https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html,
> I see there's only read() and write() methods.
>
> Anyone has any suggestions how to delete Elasticsearch documents using
> Beam's ElasticsearchIO?
>
> Best,
> Arif
>


ElasticsearchIO delete document

2020-12-01 Thread Arif Alili
Hi all,

I am writing to Elasticsearch using Beam (Google Dataflow) class
ElasticsearchIO. Creating indexes and writing document goes well, however,
I am struggling to find a way to delete Elasticsearch documents, looking at
this documentation
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html,
I see there's only read() and write() methods.

Anyone has any suggestions how to delete Elasticsearch documents using
Beam's ElasticsearchIO?

Best,
Arif