Contributor permission for Beam Jira tickets

2021-01-12 Thread Dmytro Kozhevin
Hi all,

My name is Dmytro, I'm a Google engineer working on making interactive Beam 
more useful for Google's engineers (which should also hopefully help to use 
Dataflow runner interactively). Can someone please add me to Jira contributors, 
so that I could assign some issues to myself? My Jira handle is 'dmkozh'.

Thanks!
Dmytro


Re: Planning a freeze on website changes to merge new designs

2021-01-12 Thread Brian Hulette
On Mon, Jan 11, 2021 at 11:00 AM Robert Bradshaw 
wrote:

> On Mon, Jan 11, 2021 at 10:38 AM Brian Hulette 
> wrote:
>
>> I spoke with Gris and Agnieszka about this on Friday. I should probably
>> fill in the background a bit.
>>
>> The strategy we've adopted ro review the new designs so far is pretty
>> similar to what Robert proposed, except rather than having a separate
>> directory and merging PRs to the master branch, they've been sending PRs to
>> merge into a separate `website-revamp` branch [1]. I've been keeping
>> `website-revamp` synced to master, and I've been careful about only merging
>> PRs that edit the website style (e.g. css and html templates) and not
>> changes to the content (markdown files), to avoid merge conflicts when we
>> finally bring the website-revamp branch into master.
>>
>
> Ah, that sounds good. For some reason I completely missed that there was a
> separate branch being used here.
>
>
>> (Conflicts in style changes can be easily resolved, conflicts in content
>> are much more difficult to tease apart)
>>
>> Unfortunately some of the recent PRs make changes to the markdown files
>> as well. I spoke with Gris and Agnieszka about this and they indicated
>> there will likely be more content changes as they edit copy and split up
>> pages.
>>
>> On Friday we discussed a couple different options:
>> 1) Make content changes on the master branch, completely separate from
>> the style changes, or
>> 2) Have a *planned* freeze in website changes to finalize the new design
>>
>> Honestly my preference is for (1), but I'm hesitant to push for it as it
>> puts more burden on the website developers, who'd need to make sure content
>> changes work in two website layouts. (2) on the other hand puts time
>> pressure on the reviewers (myself and Pablo so far).
>>
>
> My preference would be for (1) as well; and in addition presumably the
> content changes would improve the current website as well as the new. There
> is also option (3) which is allowing development to continue on the dev
> branch (rather than a freeze) and placing the responsibility of correctly
> recognizing and resolving conflicts on the owners of the website-revamp
> branch.
>

I see myself (and all Beam committers) as the owner of the website-revamp
branch, It's in the apache/beam repo.


>
> It might be worth highlighting an example of a content change that makes
> any of these workflows difficult.
>

The most compelling example is the extensive changes to the contribution
guide here:
https://github.com/apache/beam/pull/13565/files#diff-3f46c575ca6547b8deef533eb8e191507edcf806529f7faecb4a56a246063af6
The PR was already missing the changes made to the contribution guide in
https://github.com/apache/beam/pull/13308. I also just merged master into
website-revamp, and the PR now has a merge conflict with the changes from
https://github.com/apache/beam/pull/13420.


>
>
>> [1] https://github.com/apache/beam/tree/website-revamp
>>
>> On Mon, Jan 11, 2021 at 10:03 AM Robert Bradshaw 
>> wrote:
>>
>>> A site-wide freeze during which there was a huge, rushed code dump was
>>> not the most effective way to manage or review the large website changes
>>> last time, and I don't think it would be a good idea to attempt that again.
>>>
>>> Instead, can we create a parallel directory/site in our repo,
>>> incrementally build/commit/review it in there, and once everyone is happy
>>> with it do a single switch with a small redirection commit (followed by
>>> deleting the old content). As for incorporating changes that happen during
>>> development, this is what every developer is already doing (on the code
>>> side) and we should take advantage of the revision control systems we use
>>> to make sure nothing is lost.
>>>
>>> - Robert
>>>
>>>
>>>
>>> On Sun, Jan 10, 2021 at 4:07 PM Griselda Cuevas  wrote:
>>>
 Hi folks,

 As you know we've been working on a revamp for the website, and we're
 getting ready to commit the work we've done. In order to minimize the risk
 of losing changes other contributors make during this period, we'd like to
 plan a freeze so we can work on making the revamp commits. A freeze in this
 context would mean that we give notice to our dev community to do not make
 any PRs or change to the site during this period.

 I'd like to propose we have a one-week freeze during the last week of
 January or the first week in February.

 What do you think?

 G

>>>


Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2021-01-12 Thread Boyuan Zhang
Hi,

I proposed to make runner-issue checkpoint frequency configurable for a
pipeline author here:
https://docs.google.com/document/d/18jNLtTyyApx0N2ytp1ytOMmUPLouj2h08N3-4SyWGgQ/edit?usp=sharing.
I believe it will also be helpful for the performance issue. Please feel
free to drop any comments there : )

On Wed, Jan 6, 2021 at 1:14 AM Jan Lukavský  wrote:

> Sorry for the typo in your name. :-)
>
> On 1/6/21 10:11 AM, Jan Lukavský wrote:
> > Hi Antonie,
> >
> > yes, for instance. I'd just like to rule out possibility that a single
> > DoFn processing multiple partitions (restrictions) brings some
> > overhead in your case.
> >
> > Jan
> >
> > On 12/31/20 10:36 PM, Antonio Si wrote:
> >> Hi Jan,
> >>
> >> Sorry for the late reply. My topic has 180 partitions. Do you mean
> >> run with a
> >> parallelism set to 900?
> >>
> >> Thanks.
> >>
> >> Antonio.
> >>
> >> On 2020/12/23 20:30:34, Jan Lukavský  wrote:
> >>> OK,
> >>>
> >>> could you make an experiment and increase the parallelism to something
> >>> significantly higher than the total number of partitions? Say 5 times
> >>> higher? Would that have impact on throughput in your case?
> >>>
> >>> Jan
> >>>
> >>> On 12/23/20 7:03 PM, Antonio Si wrote:
>  Hi Jan,
> 
>  The performance data that I reported was run with parallelism = 8.
>  We also ran with parallelism = 15 and we observed similar behaviors
>  although I don't have the exact numbers. I can get you the numbers
>  if needed.
> 
>  Regarding number of partitions, since we have multiple topics, the
>  number of partitions varies from 180 to 12. The highest TPS topic
>  has 180 partitions, while the lowest TPS topic has 12 partitions.
> 
>  Thanks.
> 
>  Antonio.
> 
>  On 2020/12/23 12:28:42, Jan Lukavský  wrote:
> > Hi Antonio,
> >
> > can you please clarify a few things:
> >
> > a) what parallelism you use for your sources
> >
> > b) how many partitions there is in your topic(s)
> >
> > Thanks,
> >
> > Jan
> >
> > On 12/22/20 10:07 PM, Antonio Si wrote:
> >> Hi Boyuan,
> >>
> >> Let me clarify, I have tried with and without using
> >> --experiments=beam_fn_api,use_sdf_kafka_read option:
> >>
> >> -  with --experiments=use_deprecated_read --fasterrCopy=true, I
> >> am able to achieve 13K TPS
> >> -  with --experiments="beam_fn_api,use_sdf_kafka_read"
> >> --fasterCopy=true, I am able to achieve 10K
> >> -  with --fasterCopy=true alone, I am only able to achieve 5K TPS
> >>
> >> In our testcase, we have multiple topics, checkpoint intervals is
> >> 60s. Some topics have a lot higher traffics than others. We look
> >> at the case with --experiments="beam_fn_api,use_sdf_kafka_read"
> >> --fasterCopy=true options a little. Based on our observation,
> >> each consumer poll() in ReadFromKafkaDoFn.processElement() takes
> >> about 0.8ms. So for topic with high traffics, it will continue in
> >> the loop because every poll() will return some records. Every
> >> poll returns about 200 records. So, it takes about 0.8ms for
> >> every 200 records. I am not sure if that is part of the reason
> >> for the performance.
> >>
> >> Thanks.
> >>
> >> Antonio.
> >>
> >> On 2020/12/21 19:03:19, Boyuan Zhang  wrote:
> >>> Hi Antonio,
> >>>
> >>> Thanks for the data point. That's very valuable information!
> >>>
> >>> I didn't use DirectRunner. I am using FlinkRunner.
>  We measured the number of Kafka messages that we can processed
>  per second.
>  With Beam v2.26 with --experiments=use_deprecated_read and
>  --fasterCopy=true,
>  we are able to consume 13K messages per second, but with Beam
>  v2.26
>  without the use_deprecated_read option, we are only able to
>  process 10K
>  messages
>  per second for the same pipeline.
> >>> We do have SDF implementation of Kafka Read instead of using the
> >>> wrapper.
> >>> Would you like to have a try to see whether it helps you improve
> >>> your
> >>> situation?  You can use
> >>> --experiments=beam_fn_api,use_sdf_kafka_read to
> >>> switch to the Kafka SDF Read.
> >>>
> >>> On Mon, Dec 21, 2020 at 10:54 AM Boyuan Zhang
> >>>  wrote:
> >>>
>  Hi Jan,
> > it seems that what we would want is to couple the lifecycle of
> > the Reader
> > not with the restriction but with the particular instance of
> > (Un)boundedSource (after being split). That could be done in
> > the processing
> > DoFn, if it contained a cache mapping instance of the source
> > to the
> > (possibly null - i.e. not yet open) reader. In @NewTracker we
> > could assign
> > (or create) the reader to the tracker, as the tracker is
> > created 

Re: Have the Java examples gradle files broken?

2021-01-12 Thread Reuven Lax
Never mind - I think I found it.

On Tue, Jan 12, 2021 at 12:31 PM Reuven Lax  wrote:

> If I try and run something from Java examples, e.g. the below:
>
> ./gradlew integrationTest -p examples/java  --tests
> org.apache.beam.examples.cookbook.BigQueryTornadoes ...
>
>
>
> I now get gradle errors that look like this. Does anyone know if something
> changed?
>
>
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build
> failed with an exception.
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter]
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] * Where:
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] Build file
> '/Users/relax/beam/examples/java/build.gradle' line: 32
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter]
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] * What went wrong:
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] A problem occurred
> evaluating project ':examples:java'.
>
> 2021-01-12T12:28:51.428-0800 [ERROR]
> [org.gradle.internal.buildevents.BuildExceptionReporter] > unexpected
> character "
>


Accessing Custom Beam Metrics in Dataproc

2021-01-12 Thread Rion Williams
Hi all,

I'm currently in the process of adding some metrics to an existing pipeline 
that runs on Google Dataproc via Spark and I'm trying to determine how to 
access these metrics and eventually expose them to Stackdriver (to be used 
downstream in Grafana dashboards).

The metrics themselves are fairly simple (a series of counters) and are defined 
as such (and accessed in DoFns throughout the pipeline):

```
/** Metrics gathered during Event-related transforms */
private object Metrics {
// This is used to keep track of any dynamically added data sources and 
their counts
val totalMessages: Counter = 
BeamMetrics.counter(Events::class.qualifiedName, "messages_total")
}
```

After initially running the pipeline in Dataproc, I wasn't able to see anything 
that specifically indicated that the metrics were being exposed at all. I 
haven't added any specific configuration to handle this within the pipeline 
itself, however I did notice an interface that I may need to consider 
implementing called MetricOptions:

```
interface MyPipelineOptions : ... MetricsOptions { ... }
```

So my questions primarily center around:

- Will Metrics be emitted automatically? Or do I need to explicitly implement 
the MetricsOptions interface for the pipeline?
- Does anyone have any experience with handling this (i.e. Pipeline > Metrics > 
Stackdriver)? I'd imagine since this is all self-contained within GCP (Dataproc 
+ Stackdriver), that it wouldn't be too rough to hand that baton off.

Any advice / articles / examples would be greatly appreciated!

Thanks,

Rion


Re: Can composite transforms have zero subtransforms?

2021-01-12 Thread Robert Bradshaw
Yes, a PTansform can have no sub-transforms, as long as it only returns its
inputs. Updating the docs would be a good idea.

On Tue, Jan 12, 2021 at 1:04 PM Brian Hulette  wrote:

> A recent bug with SqlTransform on Dataflow Runner V2 [1] revealed an
> interesting ambiguity in the Beam model: it's not clear if a composite
> transform is allowed to have zero sub-transforms [2]. This may sound like
> an academic concern, but it can happen if a PTransform returns its own
> input, making it a no-op.
>
> I tend to agree with Kenn's comment in the jira that we should allow it.
> If we don't this puts a burden on SDKs, they would need to either
> a) detect when a PTransform returns one of its inputs and raise an error,
> or
> b) find and replace any such no-ops before generating a portable pipeline
> graph
>
> If there aren't any objections to allowing "empty" composites I'll send a
> PR to clarify this in beam_runner_api.proto
>
> Brian
>
> [1] https://issues.apache.org/jira/browse/BEAM-11614
> [2]
> https://github.com/apache/beam/blob/05c8471b27e03e5611a2a13137c4a785f2d17fc9/model/pipeline/src/main/proto/beam_runner_api.proto#L152-L155
>


Re: [Input Needed] Updating the documentation left navigation (was:Website Revamp update & navigation for documentation page)

2021-01-12 Thread Kenneth Knowles
I see. Yes, I meant that things like  (or some level) do not appear in
the right nav. I see, if content is the next round I will save more
comments. I think that once we have a style, including what levels are
shown where, then it will be time to fix content issues.

Kenn

On Mon, Dec 21, 2020 at 10:46 AM Griselda Cuevas  wrote:

> Are you referring to things missing in the left navigation or in the right
> one?
>
> If things are missing on the left navigation, we will fix in this round,
> if there are things missing in the right navigation we won't fix this time
> around, the right navigation is automatically created from formatting in
> the documentation, displaying headers/sub-headers, and we won't touch the
> actual content this time around.
>
> On Mon, 21 Dec 2020 at 09:40, Kenneth Knowles  wrote:
>
>> I like the change. The left nav is inconsistent about things being
>> re-collapsed when we return to the page anyhow, and I never really
>> understood what part of navigation belonged in which side.
>>
>> And it seems there are still more levels of subheaders than we can
>> display. Is the plan to simply not have those in the nav? (this seems fine
>> with me)
>>
>> Kenn
>>
>> On Mon, Dec 14, 2020 at 7:30 PM Griselda Cuevas  wrote:
>>
>>> Hi dev@ community,
>>>
>>>
>>> I wanted to fork this conversation to make it more visible. We'd like to
>>> get your feedback on the proposed update for the documentation page in the
>>> website.
>>>
>>>
>>> The proposed change is that the left navigation (which is the website
>>> main navigation) stays high level and only shows titles, leaving the
>>> detailed hierarchy in the right navigation, which appears when you click on
>>> a Documentation topic or when you land directly on it.
>>>
>>>
>>> You can see a screenshot of the current and proposed designs in the wiki
>>> repo with the other website designs [1]. The names of the files are:
>>> current_DocsNav for the current design, and proposed_DocsNav for the
>>> proposed design.
>>>
>>>
>>> Please let us know your preference or suggestions through this week.
>>>
>>>
>>> Gris
>>>
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/BEAM/Website+Redesign+Files
>>>
>>


Re: pulllicenses fails while building

2021-01-12 Thread Kyle Weaver
FYI Emily fixed the bug with unintended license pulling. So bugs like the
one Reuven originally reported should be bypassed by default.

On Fri, Jan 8, 2021 at 5:01 PM Kyle Weaver  wrote:

> > $0.02 can we make it a separate target rather than an existing target
> with flags?
>
> Something like :sdks:java:container:java8:dockerWithLicenses? I think that
> would make sense, but it's orthogonal to this bug. Can you file a feature
> request Kenn?
>
> On Fri, Jan 8, 2021 at 4:50 PM Kenneth Knowles  wrote:
>
>> $0.02 can we make it a separate target rather than an existing target
>> with flags?
>>
>> On Fri, Jan 8, 2021 at 11:06 AM Kyle Weaver  wrote:
>>
>>> ./gradlew :sdks:java:container:build runs the pullLicenses task.
>>> https://issues.apache.org/jira/browse/BEAM-11586
>>>
>>> On Fri, Jan 8, 2021 at 10:59 AM Kyle Weaver  wrote:
>>>
 Which Gradle command are you running Reuven? And which Gradle task is
 failing?

 > Also, I am not sure why licenses are pulled for regular development
 case. I thought it was not meant to run by default.

 +1 this seems like a bug. IIRC you are supposed to pass
 "-Pdocker-pull-licenses" to Gradle to enable pulling licenses, but it
 shouldn't be set by default.

 On Thu, Jan 7, 2021 at 7:48 PM Reuven Lax  wrote:

> Unfortunately, setting that env variable didn't change anything.
>
> On Thu, Jan 7, 2021 at 7:09 PM Ahmet Altay  wrote:
>
>> Googled this a bit. Setting this env variable might fix the problem
>> by bypassing the check:
>> export PYTHONHTTPSVERIFY=0
>>
>> Could you try that? If it works we can make that part of the script.
>>
>> Also, I am not sure why licenses are pulled for regular development
>> case. I thought it was not meant to run by default.
>>
>> /cc +Tyson Hamilton  +Emily Ye
>> 
>>
>> On Thu, Jan 7, 2021 at 5:42 PM Reuven Lax  wrote:
>>
>>> I recently upgraded to Python 3 on my build machine. Now all
>>> attempts to build Beam fail with the following. Anyone know how to ix 
>>> this?
>>>
>>> urllib.error.URLError: >> CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get 
>>> local
>>> issuer certificate (_ssl.c:1123)>
>>>
>>> ERROR:root:Invalid url for paranamer-2.7:
>>> https://raw.githubusercontent.com/paul-hammant/paranamer/master/LICENSE.txt
>>> .
>>>
>>>
>>>
>>>


Can composite transforms have zero subtransforms?

2021-01-12 Thread Brian Hulette
A recent bug with SqlTransform on Dataflow Runner V2 [1] revealed an
interesting ambiguity in the Beam model: it's not clear if a composite
transform is allowed to have zero sub-transforms [2]. This may sound like
an academic concern, but it can happen if a PTransform returns its own
input, making it a no-op.

I tend to agree with Kenn's comment in the jira that we should allow it. If
we don't this puts a burden on SDKs, they would need to either
a) detect when a PTransform returns one of its inputs and raise an error, or
b) find and replace any such no-ops before generating a portable pipeline
graph

If there aren't any objections to allowing "empty" composites I'll send a
PR to clarify this in beam_runner_api.proto

Brian

[1] https://issues.apache.org/jira/browse/BEAM-11614
[2]
https://github.com/apache/beam/blob/05c8471b27e03e5611a2a13137c4a785f2d17fc9/model/pipeline/src/main/proto/beam_runner_api.proto#L152-L155


Have the Java examples gradle files broken?

2021-01-12 Thread Reuven Lax
If I try and run something from Java examples, e.g. the below:

./gradlew integrationTest -p examples/java  --tests
org.apache.beam.examples.cookbook.BigQueryTornadoes ...



I now get gradle errors that look like this. Does anyone know if something
changed?



2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build
failed with an exception.

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter]

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] * Where:

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] Build file
'/Users/relax/beam/examples/java/build.gradle' line: 32

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter]

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] * What went wrong:

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] A problem occurred
evaluating project ':examples:java'.

2021-01-12T12:28:51.428-0800 [ERROR]
[org.gradle.internal.buildevents.BuildExceptionReporter] > unexpected
character "


Re: Beam with Confluent Schema Registry and protobuf

2021-01-12 Thread Cristian Constantinescu
Hi Alexey,

The short answer is, I hope the code in my sample project makes its way
back into Beam, but it may take some time. Hopefully until then people who
are in a similar position have a demo of a workaround they can work with.

The longer answer is:

The history of this is a bit lengthy, let me make a summary for those who
are not aware of it.

1. Beam uses old versions (5.3) of Confluent. Those versions do not support
protobuf or JSON schemas. That was added in 5.5. Confluent is now at
version 6.

2. The newer Confluent libs depend on Avro 1.9 (last time I checked). Which
conflicts with Beams dependency on Avro 1.8.

3. Avro 1.9+ has breaking changes from 1.8.

4. There were at least two attempts at updating Beam to use Avro 1.9. One
of those was mine. It included an attempt to move all of Avro outside of
Beam core to it's own module, like Beam's protobuf support.

5. Unfortunately, that isn't a viable path forward for the community as it
breaks Beam's backwards compatibility. But it did provide insights on the
spread of Avro inside of Beam.

6. The latest agreement is that there needs some tests that need to be
added to Beam to check compatibility with Avro 1.9+.

I wanted to take this task up, but didn't get to it yet. Furthermore, I'm
not sure how those tests are going to help with the upgrade of the
Confluent libs as I'm not sure we can have two versions of Avro inside of
Beam. I'll have to check to see what's possible.

Once the Confluent libs are updated, adding support for protofbufs and the
Confluent schema registry looks trivial. We just need to decide if we use
the ConfluentSchemaRegistryDeserializerProvider for the task or if we have
a separate class for similar functionality using protobuf.



On Tue., Jan. 12, 2021, 09:58 Alexey Romanenko, 
wrote:

> Hi Cristian,
>
> Great!
>
> Would you be interested to add this functionality into
> original ConfluentSchemaRegistryDeserializerProvider (along with Json
> support maybe) [1] ?
> Though, it seems that it will require an update of Confluent deps to at
> least version 5.5 [2] and it can be blocked by Beam Avro deps update.
>
> [1] https://issues.apache.org/jira/browse/BEAM-9330
> [2] https://www.confluent.io/blog/introducing-confluent-platform-5-5/
>
> On 9 Jan 2021, at 05:16, Cristian Constantinescu  wrote:
>
> Hi everyone,
>
> Beam currently has a dependency on older versions of the Confluent libs.
> It makes it difficult to use Protobufs with the Confluent Schema Registry
> as ConfluentSchemaRegistryDeserializerProvider only supports Avro.
>
> I put up together a very simple project to demo how it can be done without
> touching any files inside of Beam. You can find it here:
> https://github.com/zeidoo/beam-confluent-schema-registry-protobuf
>
> Any comments are welcomed, especially if there are better ways of doing
> things.
>
> Cheers!
>
>
>


Re: [Input needed] Capability Matrix Visual Redesign for extended version

2021-01-12 Thread Agnieszka Sell
Hi Kenn,

thank you for your answers!

We'll update the Capability Matrix UI according to your design related
suggestions. It will be implemented in a way that allows you to rather
easily change its text content, remove/add columns and rows etc. :)

Kind regards,

Agnieszka

On Wed, Jan 6, 2021 at 6:48 PM Kenneth Knowles  wrote:

> Very good questions. Answers inline.
>
> On Wed, Jan 6, 2021 at 8:16 AM Agnieszka Sell 
> wrote:
>
>> Hi Kenneth,
>>
>> Thank you for your feedback about the Capability Matrix! I have several
>> questions about it:
>>
>> *Feedback: I think we can also remove rows that are not started or not 
>> complete in the Beam Model, and remove the Beam Model column.*
>> Question:  If we remove the Beam model column the whole point of making it 
>> static and showing the capabilities would be lost. Isn't the point to show 
>> capabilities of Beam vs. other tools?
>>
>>
> To clarify the purpose of the capability matrix: it is not comparing Beam
> vs other tools. It is comparing adapters that run a Beam pipeline on top of
> other tools. For example the "Apache Spark" column describes the
> capabilities of Beam's "SparkRunner", not Spark itself. Maybe we need to
> adjust the wording above the matrix to make this clear.
>
> So the column with the title "What is being computed?" is already a full
> list of the features of the Beam Model. The rows where "Beam Model" has an
> "X" or "~" are just ideas for future work, or features still in progress.
>
> *Feedback: I think Splittable DoFn really just deserves one row for bounded, 
> one for unbounded, and any caveats go in the details.*
>> Question: How would it look like? All this in one matrix or separate?
>>
>>
> I suggest to add it as a row in "What is being computed?" like ParDo,
> GroupByKey, ..., Stateful Processing, Splittable DoFn.
>
>
>>
>> *Feedback: All the windowing rows can be condensed into "Basic windowing 
>> support" and "Merging windowing support" and any runner that can only run a 
>> couple WindowFns can have details in the caveats. At this point any runner 
>> that doesn't do Windowing by invoking a user's WindowFn simply doesn't 
>> really support windowing in the model.*
>> Suggestion: Do we still have a separate matrix for only two(?) rows?
>>
>>
> My opinion may be controversial... I don't care that much about splitting
> What/Where/When/How. Especially it is confusing to use "Where" to talk
> about event time.
>
> Personally, I would just make all the last three tables into a single
> table "Windowing and Triggering" and the rows "Basic windowing support",
> "Merging windowing support", "Configurable triggering", "Allowed lateness",
> "Discarding mode", "Accumulating mode". I would remove Timers from that
> table and rename "Stateful processing" in the table above to "State &
> timers" since these are really one feature taken together.
>
> Many of those decisions are not really part of the redesign, but just
> ideas to save space. If you need more space savings, I can find more... for
> example there is no value to ParDo, GroupByKey, and Flatten being separate,
> really. If you don't have those all implemented, you don't have a  Beam
> runner at all, so they will never be different. This could be omitted. Or
> it could be a single "Baseline runner" row to add caveats. For example the
> existing caveats are unnecessary: Spark has a caveat on GroupByKey that is
> really about triggers. Structured streaming has "~" but the details are not
> actually caveats.
>
> Kenn
>
>
>> Kind regards,
>>
>> Agnieszka
>>
>> On Mon, Dec 21, 2020 at 7:49 PM Griselda Cuevas  wrote:
>>
>>> Thanks Kenn, this is super helpful.
>>>
>>>
>>>
>>> On Mon, 21 Dec 2020 at 09:57, Kenneth Knowles  wrote:
>>>
 For the capability matrix, part of the problem is that the rows don't
 all make that much sense, as we've discussed a couple times.

 But assuming we keep the content identical, maybe we could just have
 the collapsed view and make the table selectable where *just* the selected
 cell controls content below? You won't be able to do side-by-side
 comparisons of the full text of things, but you will be able to keep the
 overview and drill in one at a time quickly. Just one idea.

 A couple ways to save space without rearchitecting it:

  - Apache Hadoop MapReduce and JStorm can be removed as they are on
 branches, not released.
  - I think we can also remove rows that are not started or not complete
 in the Beam Model, and remove the Beam Model column.
  - I think Splittable DoFn really just deserves one row for bounded,
 one for unbounded, and any caveats go in the details.
  - All the windowing rows can be condensed into "Basic windowing
 support" and "Merging windowing support" and any runner that can only run a
 couple WindowFns can have details in the caveats. At this point any runner
 that doesn't do Windowing by invoking a user's WindowFn simply doesn't
 

Re: Beam with Confluent Schema Registry and protobuf

2021-01-12 Thread Alexey Romanenko
Hi Cristian,

Great! 

Would you be interested to add this functionality into original 
ConfluentSchemaRegistryDeserializerProvider (along with Json support maybe) [1] 
? 
Though, it seems that it will require an update of Confluent deps to at least 
version 5.5 [2] and it can be blocked by Beam Avro deps update.

[1] https://issues.apache.org/jira/browse/BEAM-9330
[2] https://www.confluent.io/blog/introducing-confluent-platform-5-5/

> On 9 Jan 2021, at 05:16, Cristian Constantinescu  wrote:
> 
> Hi everyone,
> 
> Beam currently has a dependency on older versions of the Confluent libs. It 
> makes it difficult to use Protobufs with the Confluent Schema Registry as 
> ConfluentSchemaRegistryDeserializerProvider only supports Avro.
> 
> I put up together a very simple project to demo how it can be done without 
> touching any files inside of Beam. You can find it here: 
> https://github.com/zeidoo/beam-confluent-schema-registry-protobuf 
> 
> 
> Any comments are welcomed, especially if there are better ways of doing 
> things.
> 
> Cheers!