Re: JB's back

2018-11-20 Thread Matthias Baetens
Good to have you back JB!

On Wed, Nov 21, 2018, 06:06 Kenneth Knowles  wrote:

> Yes, welcome back!
>
> On Tue, Nov 20, 2018 at 9:51 PM Ahmet Altay  wrote:
>
>> Welcome back!
>>
>> On Tue, Nov 20, 2018 at 9:11 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi guys,
>>>
>>> Sorry to have been quiet recently.
>>>
>>> After some rushy things and having been sick last week, I'm back on Beam.
>>>
>>> With Alexey, Etienne and Ismaël, we are working on the Spark runner and
>>> I'm also resuming several works I was holding on the IOs.
>>>
>>> Regards
>>> JB
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>> --


Re: JB's back

2018-11-20 Thread Ahmet Altay
Welcome back!

On Tue, Nov 20, 2018 at 9:11 PM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> Sorry to have been quiet recently.
>
> After some rushy things and having been sick last week, I'm back on Beam.
>
> With Alexey, Etienne and Ismaël, we are working on the Spark runner and
> I'm also resuming several works I was holding on the IOs.
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


JB's back

2018-11-20 Thread Jean-Baptiste Onofré
Hi guys,

Sorry to have been quiet recently.

After some rushy things and having been sick last week, I'm back on Beam.

With Alexey, Etienne and Ismaël, we are working on the Spark runner and
I'm also resuming several works I was holding on the IOs.

Regards
JB
-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Bay Area Apache Beam Kickoff!

2018-11-20 Thread Jean-Baptiste Onofré
Nice !!

Unfortunately I won't be able to be there. But good luck and I'm sure it
will be a great meetup !

Regards
JB

On 20/11/2018 02:36, Austin Bennett wrote:
> We have our first meetup scheduled for December 12th in San Francisco.  
> 
> Andrew Pilloud, a software engineer at Google and Beam committer, will
> demo the latest feature in Beam SQL: a standalone SQL shell. The talk
> cover why SQL is a good fit for streaming data processing, the technical
> details of the Beam SQL engine, and a peek into our future plans.
> 
> Kenn Knowles, a founding PMC Member and incoming PMC Chair for the
> Apache Beam project, as well as computer scientist and engineer at
> Google will share about all things Beam. Where it is, where its been,
> where its going.
> 
> More info:
>  https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> 
> For those in/around town (or that can be) come join in the fun!  
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-20 Thread Jean-Baptiste Onofré
Hi Cham,

it sounds good to me.

I'm resuming some works on IOs but nothing blocker.

Regards
JB

On 21/11/2018 03:59, Chamikara Jayalath wrote:
> Hi All,
> 
> Looks like there are three blockers in the burndown list but they are
> actively being worked on.
> 
> If there's no objection I'll create the release branch tomorrow morning.
> We can cherry-pick fixes to the blockers before building the first RC
> hopefully on Monday.
> 
> Thanks,
> Cham
> 
> 
> On Sat, Nov 17, 2018 at 10:38 AM Kenneth Knowles  > wrote:
> 
> 
> 
> On Fri, Nov 16, 2018, 18:25 Thomas Weise   wrote:
> 
> 
> 
> On Thu, Nov 15, 2018 at 10:53 PM Charles Chen  > wrote:
> 
> +1
> 
> Note that we need to temporarily revert
> https://github.com/apache/beam/pull/6683 before the release
> branch cut per the discussion
> at 
> https://lists.apache.org/thread.html/78fe33dc41b04886f5355d66d50359265bfa2985580bb70f79c53545@%3Cdev.beam.apache.org%3E
> 
> 
> Since it is for the release I would prefer this to be done in
> the release branch, and not in master.
> 
> 
> +1 do it (and all remaining burndown) on the release branch.
> Tracking Jira targeting 2.9.0?
> 
> Kenn
> 
> 
> 
> 
> 
> 
> 
>  
> 
> On Thu, Nov 15, 2018 at 9:18 PM Tim
>  > wrote:
> 
> Thanks Cham
> +1
> 
> On 16 Nov 2018, at 05:30, Thomas Weise  > wrote:
> 
>> +1
>>
>>
>> On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay
>> mailto:al...@google.com>> wrote:
>>
>> +1 Thank you.
>>
>> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles
>> mailto:k...@apache.org>> wrote:
>>
>> SGTM. Thanks for keeping track of the schedule.
>>
>> Kenn
>>
>> On Thu, Nov 15, 2018 at 1:59 PM Chamikara
>> Jayalath > > wrote:
>>
>> Hi All,
>>
>> According to the release calendar [1]
>> branch cut date for Beam 2.9.0 release is
>> 11/21/2018. Since previous release branch
>> was cut close to the respective calendar
>> date I'd like to propose cutting release
>> branch for 2.9.0 on 11/21/2018. Next week
>> is Thanksgiving holiday in US and possibly
>> some folks will be out so we can try to
>> produce RC1 on Monday after (11/26/2018).
>> We can attend to current blocker JIRAs [2]
>> in the meantime. 
>>
>> I'd like to volunteer to perform this release.
>>
>> WDYT ?
>>
>> Thanks,
>> Cham
>>
>> [1] 
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>> [2] https://s.apache.org/beam-2.9.0-burndown
>>
>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: MetricResult querying design questiosn

2018-11-20 Thread Kenneth Knowles
(1)-(3) make sense to me; perhaps (2) can be autogenerated by gRPC and
wrapped into nicer APIs as desired. I think if you transliterate from Java
to proto3 then the sketches in the "Querying Metrics" section of
http://s.apache.org/beam-metrics-api have some of the same ideas - what is
left blank is what MetricsFilters would support.

Kenn

On Tue, Nov 20, 2018 at 5:39 PM Alex Amato  wrote:

> I was wondering if we have some design about MetricResult querying, which
> is a queryable object that exists on the PipelineResult.
>
> IMO, the way this should ideally work, is:
>
> (1) The runner would be responsible for querying the metrics, since a
> Runner will have its own metrics aggregation system, which can be queried.
>
> (2) Then APIs to invoke this need to be implemented in every language. We
> need API bindings for every language, but would want them to delegate to
> the Runner
>
> (3) Further, we would need a way to unify how all of the runners
> communicate metrics, i.e. we should have some semantics, that if a
> Metrics/MonitoringInfos with a certain spec should be returned in the
> MetricResult in a specific way (That is to avoid weird issues, like
> Dataflow Runner modifying some PTransform names for example).
>
> Ideally, The MetricResult should give you the same metric no matter which
> runner you are using.
>
> I don't think this is the case today. I think it might be more true that
> each SDK needs to figure out which runner its using and invoke some code to
> query metrics for that runner.
>
> Is there a document somewhere? Or if anyone has implemented this, for one
> of the runners+sdks, would it be possible to give a brief overview of how
> this works?
>
> Thanks,
> Alex
>


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-20 Thread Chamikara Jayalath
Hi All,

Looks like there are three blockers in the burndown list but they are
actively being worked on.

If there's no objection I'll create the release branch tomorrow morning. We
can cherry-pick fixes to the blockers before building the first RC
hopefully on Monday.

Thanks,
Cham


On Sat, Nov 17, 2018 at 10:38 AM Kenneth Knowles  wrote:

>
>
> On Fri, Nov 16, 2018, 18:25 Thomas Weise 
>>
>>
>> On Thu, Nov 15, 2018 at 10:53 PM Charles Chen  wrote:
>>
>>> +1
>>>
>>> Note that we need to temporarily revert
>>> https://github.com/apache/beam/pull/6683 before the release branch cut
>>> per the discussion at
>>> https://lists.apache.org/thread.html/78fe33dc41b04886f5355d66d50359265bfa2985580bb70f79c53545@%3Cdev.beam.apache.org%3E
>>>
>>>
>> Since it is for the release I would prefer this to be done in the release
>> branch, and not in master.
>>
>
> +1 do it (and all remaining burndown) on the release branch. Tracking Jira
> targeting 2.9.0?
>
> Kenn
>
>
>
>
>
>>
>>
>>
>>
>>> On Thu, Nov 15, 2018 at 9:18 PM Tim  wrote:
>>>
 Thanks Cham
 +1

 On 16 Nov 2018, at 05:30, Thomas Weise  wrote:

 +1


 On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:

> +1 Thank you.
>
> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles 
> wrote:
>
>> SGTM. Thanks for keeping track of the schedule.
>>
>> Kenn
>>
>> On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> Hi All,
>>>
>>> According to the release calendar [1] branch cut date for Beam 2.9.0
>>> release is 11/21/2018. Since previous release branch was cut close to 
>>> the
>>> respective calendar date I'd like to propose cutting release branch for
>>> 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and 
>>> possibly
>>> some folks will be out so we can try to produce RC1 on Monday after
>>> (11/26/2018). We can attend to current blocker JIRAs [2] in the 
>>> meantime.
>>>
>>> I'd like to volunteer to perform this release.
>>>
>>> WDYT ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>> [2] https://s.apache.org/beam-2.9.0-burndown
>>>
>>>
>


Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-20 Thread Robert Bradshaw
Two hours is too quick for people around the world to respond. If
something is obviously wrong, that may be fine, but otherwise we
should give others time to respond.

I think there's another important distinction that impacts urgency and
impact: by definition, Beam's on Precommit/Postcommit tests run at
HEAD, but upstream projects are not so tightly constrained.

Thanks for offering to update the documentation! I'll be happy to
review once the policy settles here.

On Tue, Nov 20, 2018 at 10:15 PM Ruoyun Huang  wrote:
>
> The instructions in the post-commit policies page [1] is helpful, by not 
> clear enough regarding what 'Rollback First' exactly means in the case of 
> breaking a downstream project.  The discussions in this thread makes things 
> more well defined.  I summarized things as an updating PR [2], please suggest 
> if I missed anything, or whether we need that update at all.
>
> Also, one idea is we can have a time limit finding alternatives instead of 
> rollbacks. That makes the steps more explicit and actionable, thus less 
> likely anyone gets bad feelings, which then partially solves the concern that 
> Maximilian brought up. Something like:  (4) If coulnd't figure out root cause 
> within 2 hours, a rollback automatically becomes the best option.   Though 
> that also depends on how much urgency such situation requires.
>
> [1] 
> https://beam.apache.org/contribute/postcommits-policies-details/index.html#rollback_first
> [2] PR draft:  https://github.com/apache/beam/pull/7095
>
> On Mon, Nov 19, 2018 at 1:49 PM Maximilian Michels  wrote:
>>
>> The way I read Thomas' original email is that it's generally not a nice
>> sign for a contributor if her work gets reverted. We all come from
>> different backgrounds. For some, reverting is just a tool to get the job
>> done, for others it might come across as offensive.
>>
>> I know of communities where reverting is the absolute last resort. Now,
>> Beam needs to find its own way. I think there are definitive advantages
>> to reverting quickly.
>>
>> In the most obvious case, when our tests are broken and a fix is not
>> viable, reverting unblocks other contributors to test their code. I
>> think this has been working fine in the past.
>>
>> In the less obvious case, an external Runner or system is broken due to
>> an update in the master. IMHO this does not warrant an immediate revert
>> on its own. As already mentioned, there should be some justification for
>> a rollback. This is not to make people's life harder but to figure out
>> whether the problem can be solved upstream or downstream, or with a
>> combination of both.
>>
>> I think Thomas wanted to address this latter case. It seems like we're
>> all more or less on the same page. The core problem is more related to
>> communicating reverts in a way that helps contributors to save face and
>> the community to work efficiently.
>>
>> Thanks,
>> Max
>>
>> On 19.11.18 10:51, Robert Bradshaw wrote:
>> > If something breaks Beam's post (or especially pre) commit tests, I
>> > agree that rollback is typically the best option and can be done
>> > quickly. The situation is totally different if it breaks downstream
>> > projects in which Kenn's three points are good criteria for determining
>> > if we should rollback, which should not be assumed to be the default 
>> > option.
>> >
>> > I would say the root cause of the problem is insufficient visibility and
>> > testing. If external-to-beam tests (or production jobs) are broken in
>> > such a way that rollback is desired, I would say the onus (maybe not a
>> > hard requirement, but a high bar for exceptions) is on whoever is asking
>> > for the rollback to create and submit an external test that demonstrates
>> > the issue. It is their choice whether this is easier than rolling
>> > forward or otherwise working around the breakage. This seems like the
>> > only long-term sustainable option and should get us out of this bad
>> > situation.
>> >
>> > (As an aside, the bar for rolling back a runner-specific PR that brake
>> > that runner may be lower, though still not automatic as other changes
>> > may depend on it.)
>> >
>> > - Robert
>> >
>> > On Sat, Nov 17, 2018 at 7:35 PM Kenneth Knowles > > > wrote:
>> >
>> > Just adapting my PR commentary to this thread:
>> >
>> > Our rollback first policy cannot apply to a change that passes all
>> > of Beam's postcommit tests. It *does* apply to Beam's postcommit
>> > suites for each and every runner; they are equally important in this
>> > regard.
>> >
>> > The purpose of rapid rollback without discussion is foremost to
>> > restore test signal and not to disrupt the work of other
>> > contributors, that is why it is OK to roll back before figuring out
>> > if the change was actually bad. If that isn't at stake, the policy
>> > doesn't make sense to apply.
>> >
>> > But...
>> >
>> >   - We have at least three examples of runners 

MetricResult querying design questiosn

2018-11-20 Thread Alex Amato
I was wondering if we have some design about MetricResult querying, which
is a queryable object that exists on the PipelineResult.

IMO, the way this should ideally work, is:

(1) The runner would be responsible for querying the metrics, since a
Runner will have its own metrics aggregation system, which can be queried.

(2) Then APIs to invoke this need to be implemented in every language. We
need API bindings for every language, but would want them to delegate to
the Runner

(3) Further, we would need a way to unify how all of the runners
communicate metrics, i.e. we should have some semantics, that if a
Metrics/MonitoringInfos with a certain spec should be returned in the
MetricResult in a specific way (That is to avoid weird issues, like
Dataflow Runner modifying some PTransform names for example).

Ideally, The MetricResult should give you the same metric no matter which
runner you are using.

I don't think this is the case today. I think it might be more true that
each SDK needs to figure out which runner its using and invoke some code to
query metrics for that runner.

Is there a document somewhere? Or if anyone has implemented this, for one
of the runners+sdks, would it be possible to give a brief overview of how
this works?

Thanks,
Alex


Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-20 Thread Mikhail Gryzykhin
Following the discussion, we are discussing as how to address failures
on external projects caused by changes to Beam project.

I believe that that rollback first in case of red pre/postcommit tests is a
valid option, since it blocks Beam development process.

In case if downstream project has test failure outside of Beam project,
doing a rollback in Beam can be really frustrating for Beam developers.

I'd support Ken and Robert on this:
1. It is best to start dev-list discussion for visibility and for a
decision on
further course of action.
2. Interested party is motivated expose or implement relevant testm(s)
inside Beam testing suites to prevent similar issue happening again.
This will also simplify the rollback question since the policy will be
applied due to tests going red. Ideally adding test will go together with
starting the discussion on dev-list and request of rollback.

--Mikhail

Have feedback ?


On Tue, Nov 20, 2018 at 1:15 PM Ruoyun Huang  wrote:

> The instructions in the post-commit policies page [1] is helpful, by not
> clear enough regarding what 'Rollback First' exactly means in the case of
> breaking a downstream project.  The discussions in this thread makes things
> more well defined.  I summarized things as an updating PR
>  [2], please suggest if I
> missed anything, or whether we need that update at all.
>
> Also, one idea is we can have a time limit finding alternatives instead of
> rollbacks. That makes the steps more explicit and actionable, thus less
> likely anyone gets bad feelings, which then partially solves the concern
> that Maximilian brought up. Something like:  (4) If coulnd't figure out
> root cause within 2 hours, a rollback automatically becomes the best
> option.   Though that also depends on how much urgency such situation
> requires.
>
> [1]
> https://beam.apache.org/contribute/postcommits-policies-details/index.html#rollback_first
> [2] PR draft:  https://github.com/apache/beam/pull/7095
>
> On Mon, Nov 19, 2018 at 1:49 PM Maximilian Michels  wrote:
>
>> The way I read Thomas' original email is that it's generally not a nice
>> sign for a contributor if her work gets reverted. We all come from
>> different backgrounds. For some, reverting is just a tool to get the job
>> done, for others it might come across as offensive.
>>
>> I know of communities where reverting is the absolute last resort. Now,
>> Beam needs to find its own way. I think there are definitive advantages
>> to reverting quickly.
>>
>> In the most obvious case, when our tests are broken and a fix is not
>> viable, reverting unblocks other contributors to test their code. I
>> think this has been working fine in the past.
>>
>> In the less obvious case, an external Runner or system is broken due to
>> an update in the master. IMHO this does not warrant an immediate revert
>> on its own. As already mentioned, there should be some justification for
>> a rollback. This is not to make people's life harder but to figure out
>> whether the problem can be solved upstream or downstream, or with a
>> combination of both.
>>
>> I think Thomas wanted to address this latter case. It seems like we're
>> all more or less on the same page. The core problem is more related to
>> communicating reverts in a way that helps contributors to save face and
>> the community to work efficiently.
>>
>> Thanks,
>> Max
>>
>> On 19.11.18 10:51, Robert Bradshaw wrote:
>> > If something breaks Beam's post (or especially pre) commit tests, I
>> > agree that rollback is typically the best option and can be done
>> > quickly. The situation is totally different if it breaks downstream
>> > projects in which Kenn's three points are good criteria for determining
>> > if we should rollback, which should not be assumed to be the default
>> option.
>> >
>> > I would say the root cause of the problem is insufficient visibility
>> and
>> > testing. If external-to-beam tests (or production jobs) are broken in
>> > such a way that rollback is desired, I would say the onus (maybe not a
>> > hard requirement, but a high bar for exceptions) is on whoever is
>> asking
>> > for the rollback to create and submit an external test that
>> demonstrates
>> > the issue. It is their choice whether this is easier than rolling
>> > forward or otherwise working around the breakage. This seems like the
>> > only long-term sustainable option and should get us out of this bad
>> > situation.
>> >
>> > (As an aside, the bar for rolling back a runner-specific PR that brake
>> > that runner may be lower, though still not automatic as other changes
>> > may depend on it.)
>> >
>> > - Robert
>> >
>> > On Sat, Nov 17, 2018 at 7:35 PM Kenneth Knowles > > > wrote:
>> >
>> > Just adapting my PR commentary to this thread:
>> >
>> > Our rollback first policy cannot apply to a change that passes all
>> > of Beam's postcommit tests. It *does* apply to Beam's 

Re: [VOTE] Release Vendored gRPC 1.13.1 and Guava 20.0, release candidate #1

2018-11-20 Thread Kenneth Knowles
+1 then.

Thanks for the detailed explanation and links. It will be great to start
using these and gaining experience with the vendored artifacts.

Kenn

On Tue, Nov 20, 2018 at 11:27 AM Lukasz Cwik  wrote:

> I also looked for documentation as to how this information is used but
> couldn't find anything beyond configuring the Maven archive plugin[1].
>
> These seem to be benign since we have been publishing them with
> beam-sdks-java-core since at least the 2.0.0 release[2].
>
> I believe these files appear because by default the Maven shade plugin and
> Gradle shadow plugin merge the contents of META-INF/ across all jar files.
>
> 1: https://maven.apache.org/guides/mini/guide-archive-configuration.html
> 2:
> http://central.maven.org/maven2/org/apache/beam/beam-sdks-java-core/2.0.0/beam-sdks-java-core-2.0.0.jar
>
> On Fri, Nov 16, 2018 at 10:35 AM Kenneth Knowles  wrote:
>
>> I notice in the vendored Guava jar there is:
>>
>> META-INF/maven/com.google.guava/guava/pom.xml
>> META-INF/maven/com.google.guava/guava/pom.properties
>>
>> Are these expected? If not, are they benign? I haven't found any
>> documentation for what these contents actually mean or do.
>>
>> There are many more in the gRPC META-INF/maven folder, but I don't have
>> familiarity with what is expected for that one.
>>
>> Kenn
>>
>> On Fri, Nov 16, 2018 at 10:15 AM Thomas Weise  wrote:
>>
>>> It would be nice to have a build task that allows to create the source
>>> artifacts locally, if we cannot publish them.
>>>
>>> +1 for the release
>>>
>>>
>>> On Fri, Nov 16, 2018 at 7:48 AM Lukasz Cwik  wrote:
>>>
 I have been relying on the Intellij's ability to decompile the class
 files, its not as good as the original source for sure.

 On Fri, Nov 16, 2018 at 3:26 AM Maximilian Michels 
 wrote:

> +1
>
> We decided not to publish source files for now. The main reason are
> possible legal issues with publishing relocated source code.
>
> On 16.11.18 05:24, Thomas Weise wrote:
> > Thanks for driving this. Did we reach a conclusion regarding
> publishing
> > relocated source artifacts? Debugging would be painful without
> (unless
> > manually installed in the local repo).
> >
> >
> > On Thu, Nov 15, 2018 at 6:05 PM Lukasz Cwik  > > wrote:
> >
> > Please review and vote on the release candidate #1 for the
> vendored
> > artifacts gRPC 1.13.1 and Guava 20.0:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific
> comments)
> >
> > The creation of these artifacts are the outcome of the discussion
> > about vendoring[1].
> >
> > The complete staging area is available for your review, which
> includes:
> > * all artifacts to be deployed to the Maven Central Repository
> [2],
> > * commit hash "3678d403fcfea6a3994d7b86cfe6db70039087b0" [3],
> > * Java artifacts were built with Gradle 4.10.2 and OpenJDK
> 1.8.0_161
> > * artifacts which are signed with the key with fingerprint
> > EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [4]
> >
> > The vote will be open for at least 72 hours. It is adopted by
> > majority approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Luke
> >
> > [1]
> >
> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
> > [2]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1052
> > [3]
> >
> https://github.com/apache/beam/tree/3678d403fcfea6a3994d7b86cfe6db70039087b0
> > [4] https://dist.apache.org/repos/dist/release/beam/KEYS
> >
>



Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Robert Bradshaw
On Tue, Nov 20, 2018 at 7:10 PM Lukasz Cwik  wrote:

> I'll perform the swap for a fraction because as I try to map more of the
> spaces to an arbitrary byte[] I naturally first map the space onto natural
> numbers before mapping to a byte[].
>
> Any preference between these options:
> A:
> // Represents a non-negative decimal number: unscaled_value * 10^(-scale)
> message Decimal {
>   // Represents the unscaled value as a big endian unlimited precision
> non-negative integer.
>   bytes unscaled_value = 1;
>   // Represents the scale
>   uint32 scale = 2;
> }
>
> B:
> // Textual representation of a decimal (i.e. "123.00")
> string decimal = 1;
>
> C:
> // Represents a non-negative decimal number: "integer"."fraction"
> message Decimal {
>   // Represents the integral part of the decimal in big endian as a big
> endian unlimited precision non-negative integer.
>   bytes integer = 1;
>   // Represents the fractional part of the decimal represented as a big
> endian unlimited precision non-negative integer.
>   bytes fraction = 2;
> }
>
> A is the most common and seems to be supported by Java (BigDecimal),
> Python (decimal module) and Go (via shopspring/decimal). B is a close
> second since many languages can convert it.
>

Any reason to not just use double? (Do we need arbitrary/fixed precision
for anything?)


> On Tue, Nov 20, 2018 at 3:09 AM Robert Bradshaw 
> wrote:
>
>> I'm still trying to wrap my head around what is meant by backlog here, as
>> it's different than what I've seen in previous discussions.
>>
>> Generally, the backlog represented a measure of the known but undone part
>> of a restriction. This is useful for a runner to understand in some manner
>> what progress is being made and where remaining work lies, and this is
>> difficult to do if expressed as an opaque byte array, and more so if
>> backlog is local to a restriction rather than an arbitrary quantity that
>> can be compared (and aggregated) across restrictions. Even better if a
>> similar measure can be applied to arbitrary (e.g. completed) restrictions
>> for estimation of a mapping to the time domain. Does using a byte[] here
>> have advantage over using a(n often integral) floating point number?
>>
>
> I like the idea of using an arbitrary precision floating point number
> (like SQL decimal, Java BigDecimal, python decimal) since it solves several
> questions such as to how to aggregate values and most languages have a
> native representation for a decimal type. The issue is about providing a
> mapping for key range based sources such as Bigtable/HBase. Imagine your at
> key 000 and you advance to key  for the restriction [0, 1), what
> fraction of work have you advanced?
>
> The only solution I can provide for the backlog is if I choose a maximum
> precision and clamp the length of the byte[] and then provide each possible
> byte string a number. For example I clamp the length to 3 and give each
> byte string a position:
> index: byte string
> 1: 0
> 2: 00
> 3: 000
> 4: 001
> 5: 01
> 6: 010
> 7: 011
> 8: 1
> 9: 10
> 10: 100
> 11: 101
> 12: 11
> 13: 110
> 14: 111
>
> Since each key is given a value including the "largest key" I can compute
> the distance between two keys.
>

Some other options I was able to come up with are:

(1) Say you represent your key in a keyspace of N characters. Map it to a
kyespace of N+1 characters by copying the string and then terminating each
with the "new" character. Map this to the reals, and every single key is
separated.

(2) Even easier, simply append a fixed non-zero character to the end of
every key before mapping to a fraction. All keys are now separable.

The larger the alphabet, the less skew this introduces.


> I have thought about increasing the precision as I find significantly
> larger keys but don't know how this will impact scaling decisions in
> runners.
>
>
>> I'm also a bit unclear on why it's desirable to pass this backlog back to
>> the SDF when trying to split restrictions. Here it seems much more natural
>> (for both the runner and SDK) to simply pass a floating point value in [0,
>> 1) for the proportion of work that should be split, rather than manipulate
>> the given backlog in to try to approximate this. (There's some ambiguity
>> here of whether multiple splits should be returned if less than half should
>> be retained.)
>>
>
> Returning the backlog using the same space as the SDF will prevent skew in
> what is returned since the SDF may make progress in the meantime. For
> example you have 100mb to process and you ask for 40% of the work and the
> SDK has processed 10mb in the meantime which means you'll get 40% of 90mb =
> 36mb back instead of 40mb.
>

I actually think this is an advantage. If I ask for 50% of your backlog, I
get 50% of the remaining work, regardless of what has been processed so
far, not some uneven distribution (or, worse, end up in a degenerate state
where the worker has processed too much to satisfy the request at all).

I also believe that 

Re: [VOTE] Release Vendored gRPC 1.13.1 and Guava 20.0, release candidate #1

2018-11-20 Thread Lukasz Cwik
I also looked for documentation as to how this information is used but
couldn't find anything beyond configuring the Maven archive plugin[1].

These seem to be benign since we have been publishing them with
beam-sdks-java-core since at least the 2.0.0 release[2].

I believe these files appear because by default the Maven shade plugin and
Gradle shadow plugin merge the contents of META-INF/ across all jar files.

1: https://maven.apache.org/guides/mini/guide-archive-configuration.html
2:
http://central.maven.org/maven2/org/apache/beam/beam-sdks-java-core/2.0.0/beam-sdks-java-core-2.0.0.jar

On Fri, Nov 16, 2018 at 10:35 AM Kenneth Knowles  wrote:

> I notice in the vendored Guava jar there is:
>
> META-INF/maven/com.google.guava/guava/pom.xml
> META-INF/maven/com.google.guava/guava/pom.properties
>
> Are these expected? If not, are they benign? I haven't found any
> documentation for what these contents actually mean or do.
>
> There are many more in the gRPC META-INF/maven folder, but I don't have
> familiarity with what is expected for that one.
>
> Kenn
>
> On Fri, Nov 16, 2018 at 10:15 AM Thomas Weise  wrote:
>
>> It would be nice to have a build task that allows to create the source
>> artifacts locally, if we cannot publish them.
>>
>> +1 for the release
>>
>>
>> On Fri, Nov 16, 2018 at 7:48 AM Lukasz Cwik  wrote:
>>
>>> I have been relying on the Intellij's ability to decompile the class
>>> files, its not as good as the original source for sure.
>>>
>>> On Fri, Nov 16, 2018 at 3:26 AM Maximilian Michels 
>>> wrote:
>>>
 +1

 We decided not to publish source files for now. The main reason are
 possible legal issues with publishing relocated source code.

 On 16.11.18 05:24, Thomas Weise wrote:
 > Thanks for driving this. Did we reach a conclusion regarding
 publishing
 > relocated source artifacts? Debugging would be painful without
 (unless
 > manually installed in the local repo).
 >
 >
 > On Thu, Nov 15, 2018 at 6:05 PM Lukasz Cwik >>> > > wrote:
 >
 > Please review and vote on the release candidate #1 for the
 vendored
 > artifacts gRPC 1.13.1 and Guava 20.0:
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific
 comments)
 >
 > The creation of these artifacts are the outcome of the discussion
 > about vendoring[1].
 >
 > The complete staging area is available for your review, which
 includes:
 > * all artifacts to be deployed to the Maven Central Repository
 [2],
 > * commit hash "3678d403fcfea6a3994d7b86cfe6db70039087b0" [3],
 > * Java artifacts were built with Gradle 4.10.2 and OpenJDK
 1.8.0_161
 > * artifacts which are signed with the key with fingerprint
 > EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [4]
 >
 > The vote will be open for at least 72 hours. It is adopted by
 > majority approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Luke
 >
 > [1]
 >
 https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
 > [2]
 >
 https://repository.apache.org/content/repositories/orgapachebeam-1052
 > [3]
 >
 https://github.com/apache/beam/tree/3678d403fcfea6a3994d7b86cfe6db70039087b0
 > [4] https://dist.apache.org/repos/dist/release/beam/KEYS
 >

>>>


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Lukasz Cwik
Ismael, I looked at the API around ByteKeyRangeTracker and
OffsetRangeTracker figured out that the classes are named as such because
they are trackers for the OffsetRange and ByteKeyRange classes. Some
options are to:
1) Copy the ByteKeyRange and call it ByteKeyRestriction and similarly copy
OffsetRange and call it OffsetRestriction. This would allow us to name the
trackers ByteKeyRestrictionTracker and OffsetRestrictionTracker. Note that
we can't rename because that would be a backwards incompatible change for
existing users of ByteKeyRange/OffsetRange. This would allow us to add
methods relevant to SDF and remove methods that aren't needed.
2) Rename ByteKeyRangeTracker to ByteKeyRangeRestrictionTracker and
OffsetRangeTracker to OffsetRangeRestrictionTracker. Not really liking this
option.
3) Leave things as they are.

What do you think?

Robert, I opened PR 7094 with the bundle finalization API changes.


On Tue, Nov 20, 2018 at 10:09 AM Lukasz Cwik  wrote:

> I'll perform the swap for a fraction because as I try to map more of the
> spaces to an arbitrary byte[] I naturally first map the space onto natural
> numbers before mapping to a byte[].
>
> Any preference between these options:
> A:
> // Represents a non-negative decimal number: unscaled_value * 10^(-scale)
> message Decimal {
>   // Represents the unscaled value as a big endian unlimited precision
> non-negative integer.
>   bytes unscaled_value = 1;
>   // Represents the scale
>   uint32 scale = 2;
> }
>
> B:
> // Textual representation of a decimal (i.e. "123.00")
> string decimal = 1;
>
> C:
> // Represents a non-negative decimal number: "integer"."fraction"
> message Decimal {
>   // Represents the integral part of the decimal in big endian as a big
> endian unlimited precision non-negative integer.
>   bytes integer = 1;
>   // Represents the fractional part of the decimal represented as a big
> endian unlimited precision non-negative integer.
>   bytes fraction = 2;
> }
>
> A is the most common and seems to be supported by Java (BigDecimal),
> Python (decimal module) and Go (via shopspring/decimal). B is a close
> second since many languages can convert it.
>
> On Tue, Nov 20, 2018 at 3:09 AM Robert Bradshaw 
> wrote:
>
>> I'm still trying to wrap my head around what is meant by backlog here, as
>> it's different than what I've seen in previous discussions.
>>
>> Generally, the backlog represented a measure of the known but undone part
>> of a restriction. This is useful for a runner to understand in some manner
>> what progress is being made and where remaining work lies, and this is
>> difficult to do if expressed as an opaque byte array, and more so if
>> backlog is local to a restriction rather than an arbitrary quantity that
>> can be compared (and aggregated) across restrictions. Even better if a
>> similar measure can be applied to arbitrary (e.g. completed) restrictions
>> for estimation of a mapping to the time domain. Does using a byte[] here
>> have advantage over using a(n often integral) floating point number?
>>
>
> I like the idea of using an arbitrary precision floating point number
> (like SQL decimal, Java BigDecimal, python decimal) since it solves several
> questions such as to how to aggregate values and most languages have a
> native representation for a decimal type. The issue is about providing a
> mapping for key range based sources such as Bigtable/HBase. Imagine your at
> key 000 and you advance to key  for the restriction [0, 1), what
> fraction of work have you advanced?
>
> The only solution I can provide for the backlog is if I choose a maximum
> precision and clamp the length of the byte[] and then provide each possible
> byte string a number. For example I clamp the length to 3 and give each
> byte string a position:
> index: byte string
> 1: 0
> 2: 00
> 3: 000
> 4: 001
> 5: 01
> 6: 010
> 7: 011
> 8: 1
> 9: 10
> 10: 100
> 11: 101
> 12: 11
> 13: 110
> 14: 111
>
> Since each key is given a value including the "largest key" I can compute
> the distance between two keys.
>
> I have thought about increasing the precision as I find significantly
> larger keys but don't know how this will impact scaling decisions in
> runners.
>
>
>> I'm also a bit unclear on why it's desirable to pass this backlog back to
>> the SDF when trying to split restrictions. Here it seems much more natural
>> (for both the runner and SDK) to simply pass a floating point value in [0,
>> 1) for the proportion of work that should be split, rather than manipulate
>> the given backlog in to try to approximate this. (There's some ambiguity
>> here of whether multiple splits should be returned if less than half should
>> be retained.)
>>
>
> Returning the backlog using the same space as the SDF will prevent skew in
> what is returned since the SDF may make progress in the meantime. For
> example you have 100mb to process and you ask for 40% of the work and the
> SDK has processed 10mb in the meantime which means you'll get 40% of 90mb =
> 

Re: [Testing] Splitting pre-commits from post-commit test targets

2018-11-20 Thread Rui Wang
Useful and really practical idea!

-Rui

On Tue, Nov 20, 2018 at 10:37 AM Ruoyun Huang  wrote:

> +1 Great improvement!  Thanks Scott!
>
> On Tue, Nov 20, 2018 at 10:33 AM Pablo Estrada  wrote:
>
>> I think this is a great idea, and a good improvement. Thanks Scott!
>> -P.
>>
>> On Tue, Nov 20, 2018 at 10:09 AM Scott Wegner  wrote:
>>
>>> I wanted to give a heads-up to a small optimization that I hope to make
>>> to our Jenkins test targets. Currently our post-commit test jobs also
>>> redundantly run pre-commit tests. I'd like to remove redundant execution to
>>> get a faster post-commit test signal. See:
>>> https://github.com/apache/beam/pull/7073
>>>
>>> In Jenkins we run pre-commits separately from post-commits, and in all
>>> cases when a language post-commit suite runs, the pre-commit is also run in
>>> a separate job (in PR, on merge, cron schedule). So, it makes sense to
>>> separate the targets. This will free up resources and give a faster signal
>>> on post-commit suites as they are doing less work.
>>>
>>> From a quick test, this shaves 27 mins off of Python post-commits, 10
>>> mins from Java, and ~1 minute from Go.
>>>
>>> The only negative impact I could imagine is if during local development
>>> you were running `./gradlew :langPostCommit` as a shortcut to run all
>>> tests. Now, in order to also run tests from pre-commit, you'd need to
>>> specify it separately: `./gradlew :langPreCommit :langPostCommit`
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>
>
> --
> 
> Ruoyun  Huang
>
>


Re: [Testing] Splitting pre-commits from post-commit test targets

2018-11-20 Thread Ruoyun Huang
+1 Great improvement!  Thanks Scott!

On Tue, Nov 20, 2018 at 10:33 AM Pablo Estrada  wrote:

> I think this is a great idea, and a good improvement. Thanks Scott!
> -P.
>
> On Tue, Nov 20, 2018 at 10:09 AM Scott Wegner  wrote:
>
>> I wanted to give a heads-up to a small optimization that I hope to make
>> to our Jenkins test targets. Currently our post-commit test jobs also
>> redundantly run pre-commit tests. I'd like to remove redundant execution to
>> get a faster post-commit test signal. See:
>> https://github.com/apache/beam/pull/7073
>>
>> In Jenkins we run pre-commits separately from post-commits, and in all
>> cases when a language post-commit suite runs, the pre-commit is also run in
>> a separate job (in PR, on merge, cron schedule). So, it makes sense to
>> separate the targets. This will free up resources and give a faster signal
>> on post-commit suites as they are doing less work.
>>
>> From a quick test, this shaves 27 mins off of Python post-commits, 10
>> mins from Java, and ~1 minute from Go.
>>
>> The only negative impact I could imagine is if during local development
>> you were running `./gradlew :langPostCommit` as a shortcut to run all
>> tests. Now, in order to also run tests from pre-commit, you'd need to
>> specify it separately: `./gradlew :langPreCommit :langPostCommit`
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>

-- 

Ruoyun  Huang


Re: [Testing] Splitting pre-commits from post-commit test targets

2018-11-20 Thread Pablo Estrada
I think this is a great idea, and a good improvement. Thanks Scott!
-P.

On Tue, Nov 20, 2018 at 10:09 AM Scott Wegner  wrote:

> I wanted to give a heads-up to a small optimization that I hope to make to
> our Jenkins test targets. Currently our post-commit test jobs also
> redundantly run pre-commit tests. I'd like to remove redundant execution to
> get a faster post-commit test signal. See:
> https://github.com/apache/beam/pull/7073
>
> In Jenkins we run pre-commits separately from post-commits, and in all
> cases when a language post-commit suite runs, the pre-commit is also run in
> a separate job (in PR, on merge, cron schedule). So, it makes sense to
> separate the targets. This will free up resources and give a faster signal
> on post-commit suites as they are doing less work.
>
> From a quick test, this shaves 27 mins off of Python post-commits, 10 mins
> from Java, and ~1 minute from Go.
>
> The only negative impact I could imagine is if during local development
> you were running `./gradlew :langPostCommit` as a shortcut to run all
> tests. Now, in order to also run tests from pre-commit, you'd need to
> specify it separately: `./gradlew :langPreCommit :langPostCommit`
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Lukasz Cwik
I'll perform the swap for a fraction because as I try to map more of the
spaces to an arbitrary byte[] I naturally first map the space onto natural
numbers before mapping to a byte[].

Any preference between these options:
A:
// Represents a non-negative decimal number: unscaled_value * 10^(-scale)
message Decimal {
  // Represents the unscaled value as a big endian unlimited precision
non-negative integer.
  bytes unscaled_value = 1;
  // Represents the scale
  uint32 scale = 2;
}

B:
// Textual representation of a decimal (i.e. "123.00")
string decimal = 1;

C:
// Represents a non-negative decimal number: "integer"."fraction"
message Decimal {
  // Represents the integral part of the decimal in big endian as a big
endian unlimited precision non-negative integer.
  bytes integer = 1;
  // Represents the fractional part of the decimal represented as a big
endian unlimited precision non-negative integer.
  bytes fraction = 2;
}

A is the most common and seems to be supported by Java (BigDecimal), Python
(decimal module) and Go (via shopspring/decimal). B is a close second since
many languages can convert it.

On Tue, Nov 20, 2018 at 3:09 AM Robert Bradshaw  wrote:

> I'm still trying to wrap my head around what is meant by backlog here, as
> it's different than what I've seen in previous discussions.
>
> Generally, the backlog represented a measure of the known but undone part
> of a restriction. This is useful for a runner to understand in some manner
> what progress is being made and where remaining work lies, and this is
> difficult to do if expressed as an opaque byte array, and more so if
> backlog is local to a restriction rather than an arbitrary quantity that
> can be compared (and aggregated) across restrictions. Even better if a
> similar measure can be applied to arbitrary (e.g. completed) restrictions
> for estimation of a mapping to the time domain. Does using a byte[] here
> have advantage over using a(n often integral) floating point number?
>

I like the idea of using an arbitrary precision floating point number (like
SQL decimal, Java BigDecimal, python decimal) since it solves several
questions such as to how to aggregate values and most languages have a
native representation for a decimal type. The issue is about providing a
mapping for key range based sources such as Bigtable/HBase. Imagine your at
key 000 and you advance to key  for the restriction [0, 1), what
fraction of work have you advanced?

The only solution I can provide for the backlog is if I choose a maximum
precision and clamp the length of the byte[] and then provide each possible
byte string a number. For example I clamp the length to 3 and give each
byte string a position:
index: byte string
1: 0
2: 00
3: 000
4: 001
5: 01
6: 010
7: 011
8: 1
9: 10
10: 100
11: 101
12: 11
13: 110
14: 111

Since each key is given a value including the "largest key" I can compute
the distance between two keys.

I have thought about increasing the precision as I find significantly
larger keys but don't know how this will impact scaling decisions in
runners.


> I'm also a bit unclear on why it's desirable to pass this backlog back to
> the SDF when trying to split restrictions. Here it seems much more natural
> (for both the runner and SDK) to simply pass a floating point value in [0,
> 1) for the proportion of work that should be split, rather than manipulate
> the given backlog in to try to approximate this. (There's some ambiguity
> here of whether multiple splits should be returned if less than half should
> be retained.)
>

Returning the backlog using the same space as the SDF will prevent skew in
what is returned since the SDF may make progress in the meantime. For
example you have 100mb to process and you ask for 40% of the work and the
SDK has processed 10mb in the meantime which means you'll get 40% of 90mb =
36mb back instead of 40mb. I also believe that the backlog should subdivide
the space so that a request for 20mb from a backlog of 100mb should
subdivide the space into 5 segments.


> Having a polymorphically-interpreted bytes[] backlog seems to add a lot of
> complexity that would need to be justified.
>

Each source needs to polymorphically interpret every generic representation
such as an integral fraction onto their space. There is a natural mapping
for remaining bytes in a file and also for number of messages on a queue
but not as clean for key range based sources as shown above.


> It seems there's consensus for the finalization protocol; perhaps that can
> be checked in as a separate PR? (Also, the idea of having to have a final
> key is not ideal, e.g. it means that for lexicographic range sources the
> empty key has different meanings depending on whether it's a start or end
> key, but I think we can provide a mark-as-done method in a future
> compatible way if this becomes too burdensome.)
>

I'll pull out the finalization to another PR.

I spoke with the Bigtable folks and they said that using "" as the start
and 

[Testing] Splitting pre-commits from post-commit test targets

2018-11-20 Thread Scott Wegner
I wanted to give a heads-up to a small optimization that I hope to make to
our Jenkins test targets. Currently our post-commit test jobs also
redundantly run pre-commit tests. I'd like to remove redundant execution to
get a faster post-commit test signal. See:
https://github.com/apache/beam/pull/7073

In Jenkins we run pre-commits separately from post-commits, and in all
cases when a language post-commit suite runs, the pre-commit is also run in
a separate job (in PR, on merge, cron schedule). So, it makes sense to
separate the targets. This will free up resources and give a faster signal
on post-commit suites as they are doing less work.

>From a quick test, this shaves 27 mins off of Python post-commits, 10 mins
from Java, and ~1 minute from Go.

The only negative impact I could imagine is if during local development you
were running `./gradlew :langPostCommit` as a shortcut to run all tests.
Now, in order to also run tests from pre-commit, you'd need to specify it
separately: `./gradlew :langPreCommit :langPostCommit`

Got feedback? tinyurl.com/swegner-feedback


Re: [BEAM-6077] FlinkRunner: Make UnboundedSource state re-scale friendly

2018-11-20 Thread Maximilian Michels

Hi Jozef,

I responded on JIRA today before I saw your mail here.

The splitting of the UnboundedSource is performed during translation of 
the Beam pipeline. It think it would be feasible to use Flink's maximum 
parallelism instead of the configured parallelism. That would enable to 
increase the parallelism at a later point in time.


Another option would be to split the sources again when scaling up; I'm 
not sure whether that would work for all sources. Scaling down should be 
easy because the wrapper supports reading from multiple sources.


Cheers,
Max

On 20.11.18 11:38, Jozef Vilcek wrote:
I want to reach out for opinions on what would be the best way to 
proceed with https://issues.apache.org/jira/browse/BEAM-6077


The problem is, that when FlinkRunner job is being restored from 
checkpoint, it needs to resurrect source and it's readers given the 
checkpoint state. State element is represented by 
`UnboundedSource.CheckpointMark` which does not tell much information. 
Within CheckpointMark there might be already stored state per key, e.g. 
in case of Kafka it is list of PartitionMarks having each partition_id 
and offset.


UnboundedSource can create a reader per single CheckpointMark and reader 
can produce single CheckpointMark from it's state. Now at rescale, 
number of CheckpointMarks retrieved from state does not correspond to 
actual parallelism. Merge or flatten needs to be invoked over list of 
marks read from state. The question is, where such logic and knowledge 
should be.


It feels similar to UnboundedSource.split(parallelism, pipelineOptions) 
and also maybe related somehow to SplittableDoFn logic. Not sure.


My question is:
1. Is there a way to achieve such splitting / merging of checkpoint mark 
with current SDK?

2. If not and it make sense to add it where it would best go? Source?
3. Some other approach Beam rookie as me do not see?

Best,
Jozef


Re: E-mail Organization

2018-11-20 Thread Robert Bradshaw
I was about to suggest tags in subject lines as well. Easier to see in
email listings than anything in the body.

On Mon, Nov 19, 2018 at 7:22 PM Lukasz Cwik  wrote:

> Putting the tags in the subject line is inline with the style of what we
> currently do using [DISCUSS], [VOTE], [BEAM-YYY] so I like that. (I forgot
> that you can edit subject after the fact, thanks for pointing that out.)
>
> On Mon, Nov 19, 2018 at 10:04 AM Kenneth Knowles  wrote:
>
>> The traditional thing to do is something like [SQL] or [Portability] in
>> the subject. Cannot be added after the email has been sent, but your email
>> client can probably do that part, right? If another user changes the
>> subject line to add more tags in their reply, I think things like gmail
>> will actually keep the thread intact and it will show up in searches too.
>> Lots of lightweight options that don't require us to invent anything.
>>
>> Kenn
>>
>> On Mon, Nov 19, 2018 at 9:21 AM Suneel Marthi  wrote:
>>
>>> Kafka uses KIPs
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>>>
>>> Flink uses FLIPs
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>
>>> So Beam - BIPs 
>>>
>>> On Mon, Nov 19, 2018 at 10:48 PM Lukasz Cwik  wrote:
>>>
 dev@beam.apache.org gets a lot of e-mail. I was wondering how other
 Apache projects help their contributors focus on design/project discussions
 (such as SQL, SplittableDoFn, Portability, Samza, Flink, Testing, ...)?

 I'm looking for a solution that allows people to tag a discussion with
 multiple topics, and that tags can be added after the e-mail has been sent
 as the discussion may cross multiple topics such as testing and SQL.

 I was initially thinking that we could embed tags like "topic:sql" in
 the message body and if something was part of multiple tags it would be
 "topic:sql topic:testing" to make it easy

 What do you think?

>>>


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Robert Bradshaw
I'm still trying to wrap my head around what is meant by backlog here, as
it's different than what I've seen in previous discussions.

Generally, the backlog represented a measure of the known but undone part
of a restriction. This is useful for a runner to understand in some manner
what progress is being made and where remaining work lies, and this is
difficult to do if expressed as an opaque byte array, and more so if
backlog is local to a restriction rather than an arbitrary quantity that
can be compared (and aggregated) across restrictions. Even better if a
similar measure can be applied to arbitrary (e.g. completed) restrictions
for estimation of a mapping to the time domain. Does using a byte[] here
have advantage over using a(n often integral) floating point number?

I'm also a bit unclear on why it's desirable to pass this backlog back to
the SDF when trying to split restrictions. Here it seems much more natural
(for both the runner and SDK) to simply pass a floating point value in [0,
1) for the proportion of work that should be split, rather than manipulate
the given backlog in to try to approximate this. (There's some ambiguity
here of whether multiple splits should be returned if less than half should
be retained.)

Having a polymorphically-interpreted bytes[] backlog seems to add a lot of
complexity that would need to be justified.

It seems there's consensus for the finalization protocol; perhaps that can
be checked in as a separate PR? (Also, the idea of having to have a final
key is not ideal, e.g. it means that for lexicographic range sources the
empty key has different meanings depending on whether it's a start or end
key, but I think we can provide a mark-as-done method in a future
compatible way if this becomes too burdensome.)

On Tue, Nov 20, 2018 at 1:22 AM Lukasz Cwik  wrote:

> I also addressed a bunch of PR comments which clarified the
> contract/expectations as described in my previous e-mail and the
> splitting/backlog reporting/bundle finalization docs.
>
> On Mon, Nov 19, 2018 at 3:19 PM Lukasz Cwik  wrote:
>
>>
>>
>> On Mon, Nov 19, 2018 at 3:06 PM Lukasz Cwik  wrote:
>>
>>> Sorry for the late reply.
>>>
>>> On Thu, Nov 15, 2018 at 8:53 AM Ismaël Mejía  wrote:
>>>
 Some late comments, and my pre excuses if some questions look silly,
 but the last documents were a lot of info that I have not yet fully
 digested.

 I have some questions about the ‘new’ Backlog concept following a
 quick look at the PR
 https://github.com/apache/beam/pull/6969/files

 1. Is the Backlog a specific concept for each IO? Or in other words:
 ByteKeyRestrictionTracker can be used by HBase and Bigtable, but I am
 assuming from what I could understand that the Backlog implementation
 will be data store specific, is this the case? or it can be in some
 case generalized (for example for Filesystems)?

>>>
>>> The backlog is tied heavily to the restriction tracker implementation,
>>> any data store using the same restriction tracker will provide the same
>>> backlog computation. For example, if HBase/Bigtable use the
>>> ByteKeyRestrictionTracker then they will use the same backlog calculation.
>>> Note that an implementation could subclass a restriction tracker if the
>>> data store could provide additional information. For example, the default
>>> backlog for a ByteKeyRestrictionTracker over [startKey, endKey) is
>>> distance(currentKey, lastKey) where distance is represented as byte array
>>> subtraction (which can be wildly inaccurrate as the density of data is not
>>> well reflected) but if HBase/Bigtable could provide the number of bytes
>>> from current key to last key, a better representation could be provided.
>>>
>>> Other common examples of backlogs would be:
>>> * files: backlog = length of file - current byte offset
>>> * message queues: backlog = number of outstanding messages
>>>
>>>

 2. Since the backlog is a byte[] this means that it is up to the user
 to give it a meaning depending on the situation, is this correct? Also
 since splitRestriction has now the Backlog as an argument, what do we
 expect the person that implements this method in a DoFn to do ideally
 with it? Maybe a more concrete example of how things fit for
 File/Offset or HBase/Bigtable/ByteKey will be helpful (maybe also for
 the BundleFinalizer concept too).

>>>
>>> Yes, the restriction tracker/restriction/SplittableDoFn must give the
>>> byte[] a meaning. This can have any meaning but we would like that the
>>> backlog byte[] representation to be lexicograhically comparable (when
>>> viewing the byte[] in big endian format and prefixes are smaller (e.g. 001
>>> is smaller then 0010) and preferably a linear representation. Note that all
>>> restriction trackers of the same type should use the same "space" so that
>>> backlogs are comparable across multiple restriction tracker instances.
>>>
>>> The backlog when provided to 

[BEAM-6077] FlinkRunner: Make UnboundedSource state re-scale friendly

2018-11-20 Thread Jozef Vilcek
I want to reach out for opinions on what would be the best way to proceed
with https://issues.apache.org/jira/browse/BEAM-6077

The problem is, that when FlinkRunner job is being restored from
checkpoint, it needs to resurrect source and it's readers given the
checkpoint state. State element is represented by
`UnboundedSource.CheckpointMark` which does not tell much information.
Within CheckpointMark there might be already stored state per key, e.g. in
case of Kafka it is list of PartitionMarks having each partition_id and
offset.

UnboundedSource can create a reader per single CheckpointMark and reader
can produce single CheckpointMark from it's state. Now at rescale, number
of CheckpointMarks retrieved from state does not correspond to actual
parallelism. Merge or flatten needs to be invoked over list of marks read
from state. The question is, where such logic and knowledge should be.

It feels similar to UnboundedSource.split(parallelism, pipelineOptions) and
also maybe related somehow to SplittableDoFn logic. Not sure.

My question is:
1. Is there a way to achieve such splitting / merging of checkpoint mark
with current SDK?
2. If not and it make sense to add it where it would best go? Source?
3. Some other approach Beam rookie as me do not see?

Best,
Jozef