Re: Caching issue in BigQueryIO

2020-12-02 Thread Chamikara Jayalath
State of Dataflow pipelines is not maintained across different runs of a
pipeline. I think here also you can add a custom ParDo that stores such
state in an external storage system and retrieve that state when starting
up a fresh pipeline.

Thanks,
Cham

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta  wrote:

> Hey folks,
>
> While using BigQueryIO for 10k tables insertion, I found that it has an
> issue in it's local caching technique for table creation. Tables are first
> search in BigQueryIO's local cache and then checks whether to create a
> table or not. The main issue is when inserting to thousands of table: let's
> suppose we have 10k tables to insert in realtime and now since we will
> deploy a fresh dataflow pipeline once in a week, local cache will be empty
> and it will take a huge time just to build that cache for 10k tables even
> though these 10k tables were already created in BigQuery.
>
> The solution i could propose for this is we can provide an option for
> using external caching services like Redis/Memcached so that we don't have
> to rebuild cache again and again after a fresh deployment of pipeline.
>


Re: BigQueryIO create dataset

2020-12-02 Thread Chamikara Jayalath
The functionality does not come from BigQueryIO itself but it just exposes
existing BigQuery feature CreateDisposition [1]. I don't think BigQuery
offers a way to automatically create datasets when writing.
Is it possible to create such Datasets from a ParDo in your pipeline that
precedes BigQueryIO write transform ?

Thanks,
Cham

[1]
https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/CreateDisposition

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta  wrote:

> Hey folks, why isn't there any capability of creating datasets
> automatically just like tables in BigQueryIO? Actually, we at our company
> have dynamic dataset architecture which means as the packet arrives, we
> need to create new datasets and tables on the go. Since BigQueryIO already
> have functionality of creating tables automatically so we were thinking
> that why not a similar functionality for dataset can be implemented in
> BigQueryIO.
>


Re: Unit tests vs. Integration Tests

2020-12-02 Thread Brian Hulette
I guess another question I should ask - is :test supposed to only run unit
tests? I've been assuming so since many modules have separate
:integrationTest tasks for *IT tests.

On Wed, Dec 2, 2020 at 4:15 PM Kyle Weaver  wrote:

> > DicomIOTest, FhirIOTest,
> HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT
>
> Looking only at the former three tests, I don't see any reason they can't
> mock their API clients, especially since all they expect the server to do
> is send back an error.
>

Fair point, that wouldn't be *too* much trouble. More than just
re-classifying them as integration tests though :)


>
> > This seems like something that is easy to get wrong without some
> automation to help. Could we run the :test targets on Jenkins using the
> sandbox command or docker to block network access?
>
> That's a great idea. Are we planning on integrating the "standardized
> developer build environment" mentioned in the original post into our CI
> somehow?
>

I was thinking it could be good to use it in CI somehow to make sure it
doesn't get out of date, but all I had in mind was running some minimal set
of tasks. Using it in this way would obviously be even better.


>
> On Wed, Dec 2, 2020 at 4:03 PM Andrew Pilloud  wrote:
>
>> We have a large number of tests that run pipelines on the Direct Runner
>> or various local runners, but don't require network access, so I don't
>> think the distinction is clear. I do agree that requiring a remote service
>> falls on the integration test side.
>>
>> This seems like something that is easy to get wrong without some
>> automation to help. Could we run the :test targets on Jenkins using the
>> sandbox command or docker to block network access?
>>
>> On Wed, Dec 2, 2020 at 3:38 PM Brian Hulette  wrote:
>>
>>> Sorry I should've included the list of tests here. So far we've run into:
>>> DicomIOTest, FhirIOTest,
>>> HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT
>>>
>>> Note the latter are called IT, but that package's build.gradle has a
>>> line to scoop ITs into the :test task (addressing in [1]).
>>>
>>> All of these tests are actually running pipelines so I think they'd be
>>> difficult to mock.
>>>
>>> [1] https://github.com/apache/beam/pull/13444
>>>
>>> On Wed, Dec 2, 2020 at 3:28 PM Kyle Weaver  wrote:
>>>
 > Should we (do we) require unit tests to be hermetic?

 We should. Unit tests are hermetic by definition. That begs the
 definition of hermetic, but clearly the internet is not.

 > Personally I think these tests should be classified as integration
 tests (renamed to *IT, and run with the :integrationTest task)

 I'm not sure which tests you're talking about, but it may be better to
 make them hermetic through mocking, depending on the intent of the test.

 On Wed, Dec 2, 2020 at 1:22 PM Brian Hulette 
 wrote:

> I've been working with Niels Basjes on a standardized developer build
> environment that can run `./gradlew check` [1]. We've run into issues
> because several Java unit tests (class *Test, run with :test) are not
> hermetic. They fail unless the environment they're running in has access 
> to
> the internet, and is authenticated to GCP with access to certain 
> resources.
> Of course the former isn't typically a blocker, but the latter certainly
> can be.
>
> Personally I think these tests should be classified as integration
> tests (renamed to *IT, and run with the :integrationTest task), but I
> realized I don't know if we have a formal definition for what should be a
> unit test vs an integration test. Should we (do we) require unit tests to
> be hermetic?
>
> Brian
>
> [1] https://github.com/apache/beam/pull/13308
>



Re: Unit tests vs. Integration Tests

2020-12-02 Thread Andrew Pilloud
We have a large number of tests that run pipelines on the Direct Runner or
various local runners, but don't require network access, so I don't think
the distinction is clear. I do agree that requiring a remote service falls
on the integration test side.

This seems like something that is easy to get wrong without some automation
to help. Could we run the :test targets on Jenkins using the sandbox
command or docker to block network access?

On Wed, Dec 2, 2020 at 3:38 PM Brian Hulette  wrote:

> Sorry I should've included the list of tests here. So far we've run into:
> DicomIOTest, FhirIOTest, HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT
>
> Note the latter are called IT, but that package's build.gradle has a line
> to scoop ITs into the :test task (addressing in [1]).
>
> All of these tests are actually running pipelines so I think they'd be
> difficult to mock.
>
> [1] https://github.com/apache/beam/pull/13444
>
> On Wed, Dec 2, 2020 at 3:28 PM Kyle Weaver  wrote:
>
>> > Should we (do we) require unit tests to be hermetic?
>>
>> We should. Unit tests are hermetic by definition. That begs the
>> definition of hermetic, but clearly the internet is not.
>>
>> > Personally I think these tests should be classified as integration
>> tests (renamed to *IT, and run with the :integrationTest task)
>>
>> I'm not sure which tests you're talking about, but it may be better to
>> make them hermetic through mocking, depending on the intent of the test.
>>
>> On Wed, Dec 2, 2020 at 1:22 PM Brian Hulette  wrote:
>>
>>> I've been working with Niels Basjes on a standardized developer build
>>> environment that can run `./gradlew check` [1]. We've run into issues
>>> because several Java unit tests (class *Test, run with :test) are not
>>> hermetic. They fail unless the environment they're running in has access to
>>> the internet, and is authenticated to GCP with access to certain resources.
>>> Of course the former isn't typically a blocker, but the latter certainly
>>> can be.
>>>
>>> Personally I think these tests should be classified as integration tests
>>> (renamed to *IT, and run with the :integrationTest task), but I realized I
>>> don't know if we have a formal definition for what should be a unit test vs
>>> an integration test. Should we (do we) require unit tests to be hermetic?
>>>
>>> Brian
>>>
>>> [1] https://github.com/apache/beam/pull/13308
>>>
>>


Re: Create External Transform with WindowFn

2020-12-02 Thread Chamikara Jayalath
Actually Reshuffle uses a custom non-merging Window (IdentityWindowFn) [1].
Dataflow Runner v2 (which is required for multi-language pipelines on
Dataflow) currently does not support custom windowing functions I believe.
So getting Reshuffle (and by extension connectors such as Snowflake) for
Dataflow Python as a cross-language transform will require support for
custom Window functions on Dataflow Runner v2 in addition to
https://issues.apache.org/jira/browse/BEAM-11360 unfortunately. I believe
we are working on the prior but I'm not sure about the exact ETA. Latter
should be done by the end of the quarter.

+Harsh Vardhan  +Robert Bradshaw 

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L79

On Mon, Nov 30, 2020 at 10:35 AM Chamikara Jayalath 
wrote:

> Please follow https://issues.apache.org/jira/browse/BEAM-11360 instead.
>
> Thanks,
> Cham
>
> On Mon, Nov 30, 2020 at 10:26 AM Steve Niemitz 
> wrote:
>
>> alright, thank you.  Is BEAM-10507 the jira to watch for any progress on
>> that?
>>
>> On Mon, Nov 30, 2020 at 12:55 PM Boyuan Zhang  wrote:
>>
>>> Hi Steve,
>>>
>>> Unfortunately I don't think there is a workaround before we have the
>>> change that Cham mentions.
>>>
>>> On Mon, Nov 30, 2020 at 8:16 AM Steve Niemitz 
>>> wrote:
>>>
 I'm trying to write an xlang transform that uses Reshuffle internally,
 and ran into this as well.  Is there any workaround to this for now (other
 than removing the reshuffle), or do I just need to wait for what Chamikara
 mentioned?  I noticed the same issue was mentioned in the SnowflakeIO.Read
 PR as well [1].

 https://github.com/apache/beam/pull/12149#discussion_r463710165

 On Wed, Aug 26, 2020 at 10:55 PM Boyuan Zhang 
 wrote:

> That explains a lot. Thanks, Cham!
>
> On Wed, Aug 26, 2020 at 7:44 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Due to the proto -> object -> proto conversion we do today, Python
>> needs to parse the full sub-graph from Java. We have hooks for 
>> PTransforms
>> and Coders but not for windowing operations. This limitation will go away
>> after we have direct Beam proto to Dataflow proto conversion in place.
>>
>> On Wed, Aug 26, 2020 at 7:03 PM Robert Burke 
>> wrote:
>>
>>> Coders should only be checked over the language boundaries.
>>>
>>> On Wed, Aug 26, 2020, 6:24 PM Boyuan Zhang 
>>> wrote:
>>>
 Thanks Cham!

  I just realized that the *beam:window_fn:serialized_**java:v1 *is
 introduced by Java *Reshuffle.viaRandomKey()*. But
 *Reshuffle.viaRandomKey()* does rewindowed into original window
 strategy(which is *GlobalWindows *in my case). Is it expected that
 we also check intermediate PCollection rather than only the PCollection
 that across the language boundary?

 More about my Ptransform:
 MyExternalPTransform  -- expand to --  ParDo() ->
 Reshuffle.viaRandomKey() -> ParDo() -> WindowInto(FixWindow) -> 
 ParDo() ->
 output void

  |

   ->
 ParDo() -> output PCollection to Python SDK

 On Tue, Aug 25, 2020 at 6:29 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Also it's strange that Java used (beam:window_fn:serialized_java:v1)
> for the URN here instead of "beam:window_fn:fixed_windows:v1" [1]
> which is what is being registered by Python [2]. This seems to be the
> immediate issue. Tracking bug for supporting custom windows is
> https://issues.apache.org/jira/browse/BEAM-10507.
>
> [1]
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/standard_window_fns.proto#L55
> [2]
> https://github.com/apache/beam/blob/bd4df94ae10a7e7b0763c1917746d2faf5aeed6c/sdks/python/apache_beam/transforms/window.py#L449
>
> On Tue, Aug 25, 2020 at 6:07 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Pipelines that use external WindowingStrategies might be failing
>> during proto -> object -> proto conversion we do today. This 
>> limitation
>> will go away once Dataflow directly starts reading Beam protos. We 
>> are
>> working on this now.
>>
>> Thanks,
>> Cham
>>
>> On Tue, Aug 25, 2020 at 5:38 PM Boyuan Zhang 
>> wrote:
>>
>>> Thanks, Robert! I want to add more details on my External
>>> PTransform:
>>>
>>> MyExternalPTransform  -- expand to --  ParDo() ->
>>> WindowInto(FixWindow) -> ParDo() -> output void
>>>

Re: Unit tests vs. Integration Tests

2020-12-02 Thread Brian Hulette
Sorry I should've included the list of tests here. So far we've run into:
DicomIOTest, FhirIOTest, HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT

Note the latter are called IT, but that package's build.gradle has a line
to scoop ITs into the :test task (addressing in [1]).

All of these tests are actually running pipelines so I think they'd be
difficult to mock.

[1] https://github.com/apache/beam/pull/13444

On Wed, Dec 2, 2020 at 3:28 PM Kyle Weaver  wrote:

> > Should we (do we) require unit tests to be hermetic?
>
> We should. Unit tests are hermetic by definition. That begs the definition
> of hermetic, but clearly the internet is not.
>
> > Personally I think these tests should be classified as integration tests
> (renamed to *IT, and run with the :integrationTest task)
>
> I'm not sure which tests you're talking about, but it may be better to
> make them hermetic through mocking, depending on the intent of the test.
>
> On Wed, Dec 2, 2020 at 1:22 PM Brian Hulette  wrote:
>
>> I've been working with Niels Basjes on a standardized developer build
>> environment that can run `./gradlew check` [1]. We've run into issues
>> because several Java unit tests (class *Test, run with :test) are not
>> hermetic. They fail unless the environment they're running in has access to
>> the internet, and is authenticated to GCP with access to certain resources.
>> Of course the former isn't typically a blocker, but the latter certainly
>> can be.
>>
>> Personally I think these tests should be classified as integration tests
>> (renamed to *IT, and run with the :integrationTest task), but I realized I
>> don't know if we have a formal definition for what should be a unit test vs
>> an integration test. Should we (do we) require unit tests to be hermetic?
>>
>> Brian
>>
>> [1] https://github.com/apache/beam/pull/13308
>>
>


Re: Unit tests vs. Integration Tests

2020-12-02 Thread Kyle Weaver
> Should we (do we) require unit tests to be hermetic?

We should. Unit tests are hermetic by definition. That begs the definition
of hermetic, but clearly the internet is not.

> Personally I think these tests should be classified as integration tests
(renamed to *IT, and run with the :integrationTest task)

I'm not sure which tests you're talking about, but it may be better to make
them hermetic through mocking, depending on the intent of the test.

On Wed, Dec 2, 2020 at 1:22 PM Brian Hulette  wrote:

> I've been working with Niels Basjes on a standardized developer build
> environment that can run `./gradlew check` [1]. We've run into issues
> because several Java unit tests (class *Test, run with :test) are not
> hermetic. They fail unless the environment they're running in has access to
> the internet, and is authenticated to GCP with access to certain resources.
> Of course the former isn't typically a blocker, but the latter certainly
> can be.
>
> Personally I think these tests should be classified as integration tests
> (renamed to *IT, and run with the :integrationTest task), but I realized I
> don't know if we have a formal definition for what should be a unit test vs
> an integration test. Should we (do we) require unit tests to be hermetic?
>
> Brian
>
> [1] https://github.com/apache/beam/pull/13308
>


Caching issue in BigQueryIO

2020-12-02 Thread Vasu Gupta
Hey folks, 

While using BigQueryIO for 10k tables insertion, I found that it has an issue 
in it's local caching technique for table creation. Tables are first search in 
BigQueryIO's local cache and then checks whether to create a table or not. The 
main issue is when inserting to thousands of table: let's suppose we have 10k 
tables to insert in realtime and now since we will deploy a fresh dataflow 
pipeline once in a week, local cache will be empty and it will take a huge time 
just to build that cache for 10k tables even though these 10k tables were 
already created in BigQuery.

The solution i could propose for this is we can provide an option for using 
external caching services like Redis/Memcached so that we don't have to rebuild 
cache again and again after a fresh deployment of pipeline.


BigQueryIO create dataset

2020-12-02 Thread Vasu Gupta
Hey folks, why isn't there any capability of creating datasets automatically 
just like tables in BigQueryIO? Actually, we at our company have dynamic 
dataset architecture which means as the packet arrives, we need to create new 
datasets and tables on the go. Since BigQueryIO already have functionality of 
creating tables automatically so we were thinking that why not a similar 
functionality for dataset can be implemented in BigQueryIO.


Unit tests vs. Integration Tests

2020-12-02 Thread Brian Hulette
I've been working with Niels Basjes on a standardized developer build
environment that can run `./gradlew check` [1]. We've run into issues
because several Java unit tests (class *Test, run with :test) are not
hermetic. They fail unless the environment they're running in has access to
the internet, and is authenticated to GCP with access to certain resources.
Of course the former isn't typically a blocker, but the latter certainly
can be.

Personally I think these tests should be classified as integration tests
(renamed to *IT, and run with the :integrationTest task), but I realized I
don't know if we have a formal definition for what should be a unit test vs
an integration test. Should we (do we) require unit tests to be hermetic?

Brian

[1] https://github.com/apache/beam/pull/13308


Proposal: Scheduled tasks

2020-12-02 Thread Chad Dombrova
Hi everyone,
Beam's niche is low latency, high throughput workloads, but Beam has
incredible promise as an orchestrator of long running work that gets sent
to a scheduler.  We've created a modified version of Beam that allows the
python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
jobs[1], Argo[2], or Google's own OpenCue[3].

The basic idea is that any element in a stream can be tagged to be executed
outside of the normal SdkWorker as an atomic "task".  A task is one
invocation of a stage, composed of one or more DoFns, against one a slice
of the data stream, composed of one or more tagged elements.   The upshot
is that we're able to slice up the processing of a stream across
potentially *many* workers, with the trade-off being the added overhead of
starting up a worker process for each task.

For more info on how we use our modified version of Beam to make visual
effects for feature films, check out the talk[4] I gave at the Beam Summit.

Here's our design doc:
https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing

And here's the github branch:
https://github.com/LumaPictures/beam/tree/taskworker_public

Looking forward to your feedback!
-chad


[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] https://argoproj.github.io/
[3] https://cloud.google.com/opencue
[4] https://www.youtube.com/watch?v=gvbQI3I03a8_channel=ApacheBeam


Re: Tests for compatibility with Avro 1.8 and 1.9

2020-12-02 Thread Ismaël Mejía
Changing dependency to provided is a breaking change for
sdks-java-core because all users of Beam would have to explicitly pass
their Avro version once they upgrade their Beam version. That's not so
nice but maybe is the intended behavior.

A softer approach would be to let it as it is (1.8) and document
explicitly that we check upwards compatibility with 1.9 and suggest
users to explicitly override the version if required.

I have not followed your work on the compatibility tests but I am
curious what is the issue with Avro 1.10?


On Tue, Dec 1, 2020 at 5:23 PM Piotr Szuberski
 wrote:
>
> I'd like to add tests verifying that Beam is compatible with both Avro 1.8 
> and 1.9 similar to what has been done to Hadoop and Kafka.
>
> Probably all Avro dependencies would have to be changed from compile to 
> provided - won't it be problematic for users? They will be broken after the 
> update unless they add Avro dependency. On the other hand they'll be able to 
> choose which version do they prefer.
>
> At the moment Beam doesn't work with Avro 1.10 so users will be resticted to 
> use either 1.8 or 1.9.
>
> Does changing Avro dependencies to provided sounds reasonable? Are there 
> particular modules that should not be changed? Or is there a better approach?


Re: Requesting Beam Jira permissions

2020-12-02 Thread Alexey Romanenko
Done.

Welcome to Beam, Jakub!

Alexey

> On 2 Dec 2020, at 11:43, Agnieszka Sell  wrote:
> 
> Hi Folks,
> 
> I'm reaching out to request contributor access to Beam Jira. It's for Jakub 
> Sadowski (username: qba123, email: jakub.sadow...@polidea.com 
> ) – he's the new Front-end developer in 
> Beam Website Revamp project :) 
> 
> Best,
> 
> Agnieszka
> 



Requesting Beam Jira permissions

2020-12-02 Thread Agnieszka Sell
Hi Folks,


I'm reaching out to request contributor access to Beam Jira. It's for Jakub
Sadowski (username: qba123, email: jakub.sadow...@polidea.com) – he's the
new Front-end developer in Beam Website Revamp project :)

Best,

Agnieszka