Still seeing these wheel build emails on my fork

2020-08-18 Thread Alex Amato
Asked about this a few weeks ago, I rebased from master as was suggested,
but I am still seeing these. I am guessing this is wasting our resources
somehow? :(


On Tue, Aug 18, 2020 at 7:28 PM Alex Amato  wrote:

> Run failed for master (010adc5)
>
> Repository: ajamato/beam
> Workflow: Build python wheels
> Duration: 8 minutes and 20.0 seconds
> Finished: 2020-08-19 02:28:43 UTC
>
> View results 
> Jobs:
>
>- build_source 
>succeeded (0 annotations)
>- Build wheels on ubuntu-latest
> cancelled (2
>annotations)
>- Build wheels on macos-latest
> failed (1 annotation)
>- Build wheels on windows-latest
> cancelled (2
>annotations)
>- Prepare GCS 
>skipped (0 annotations)
>- Upload source to GCS bucket
> skipped (0
>annotations)
>- Tag repo nightly 
>skipped (0 annotations)
>- Upload wheels to GCS bucket
> skipped (0
>annotations)
>- List files on Google Cloud Storage Bucket
> skipped (0
>annotations)
>
> —
> You are receiving this because this workflow ran on your branch.
> Manage your GitHub Actions notifications here
> .
>


Re: BEAM-9045 Implement an Ignite runner using Apache Ignite compute grid

2020-08-18 Thread Saikat Maitra
Hi Val,

Thank you for your response. I like the idea of reactive event based
processing engine for fault tolerance. As you mentioned it will be upto
underlying system to manage job execution and offer fault tolerance and we
will need to build it in Ignite compute execution model.

I looked into Flink and Samza runners and they both offer fault
tolerance using checkpointing mechanism.

Flink - Fault-tolerance with *exactly-once* processing guarantees [1]
Samza - Fault-tolerance with support for incremental checkpointing of state
instead of full snapshots. This enables Samza to scale to applications with
very large state. [2]

I will look into it further how we can implement checkpointing[3] for
Ignite compute job when running beam pipeline.

[1] https://beam.apache.org/documentation/runners/flink/
[2] https://beam.apache.org/documentation/runners/samza/
[3] https://apacheignite.readme.io/docs/checkpointing

Regards,
Saikat





On Tue, Aug 18, 2020 at 7:29 PM Valentin Kulichenko <
valentin.kuliche...@gmail.com> wrote:

> Hi Saikat,
>
> Thanks for clarifying. Is there a Beam component that monitors the state,
> or this is up to the application? If something fails, will the application
> have to retry the whole pipeline?
>
> My concern is that Ignite compute actually provides very limited
> guarantees, especially for the async execution. There are some failover
> mechanisms, but overall it's up to the application to track the state and
> retry. Moreover, if the application fails, all jobs it has submitted are
> canceled.
>
> I'm thinking that Ignite should have a reactive event-based processing
> engine. The basic idea is this:
> - an application submits an event into the cluster
> - the event is persisted in Ignite to be eventually processed
> - a processed event may result in some new events that are submitted in
> the similar fashion
>
> Ignite will provide the at-least-once guarantee (or even exactly-once
> under certain assumptions) for all the event handlers, so a user can create
> a whole chain by submitting a single event, and they don't have to worry
> about failures - it's up to Ignite to handle them.
>
> It seems to me that it might be beneficial for the Beam runner to have
> such an engine under the hood. What do you think?
>
> -Val
>
> On Mon, Aug 17, 2020 at 6:00 PM Saikat Maitra 
> wrote:
>
>> Hi,
>>
>> Luke - Thank you for sharing the details for the portability layer for
>> Flink, Samza and Spark. I will look into them and will reach out if I have
>> any questions.
>>
>> Val - Thank you for your response, yes I am planning to run the beam
>> pipeline using Ignite compute engine in async run. Here is a sample code
>> for the run method.
>>
>> IgnitePipelineResult pipelineResult = new IgnitePipelineResult(job,
>> metricsAccumulator);
>> ComputeTaskFuture computeTaskFuture =
>> ignite.compute().withAsync().run(
>> (r, f) -> {
>>   pipelineResult.freeze(f);
>>   metricsAccumulator.destroy();
>>   ignite.shutdown();
>> });
>> pipelineResult.setComputeFuture(asyncCompute.future());
>>
>> return pipelineResult;
>>
>>
>> My understanding is for failover scenarios we will need to map the job
>> state from Ignite known state to Beam Job state, an example like in
>> JetPipelineResult
>>
>> https://github.com/apache/beam/blob/master/runners/jet/src/main/java/org/apache/beam/runners/jet/JetPipelineResult.java#L68-L90
>>
>> Regards,
>> Saikat
>>
>>
>>
>>
>>
>>
>> On Mon, Aug 17, 2020 at 2:27 PM Valentin Kulichenko <
>> valentin.kuliche...@gmail.com> wrote:
>>
>> > Hi Saikat,
>> >
>> > This sounds very interesting - I've been thinking about how Ignite
>> compute
>> > engine could be enhanced, and integration with Apache Beam is one of the
>> > options I have in mind. Can you please describe how you plan to
>> implement
>> > this? Will it run on top of the Ignite Compute Grid? How are you going
>> to
>> > handle the failover, especially in the case of async pipeline execution?
>> >
>> > -Val
>> >
>> > On Sat, Aug 15, 2020 at 12:50 PM Saikat Maitra > >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I have been working on implementing the Apache Ignite Runner to run
>> > Apache
>> > > Beam pipeline. I have created IgniteRunner and IgnitePipelineOptions.
>> I
>> > > have implemented the normalize pipeline method and currently working
>> on
>> > run
>> > > method implementation for Pipeline and IgnitePipelineTranslator.
>> > >
>> > > Jira : https://issues.apache.org/jira/browse/BEAM-9045
>> > >
>> > > PR : https://github.com/apache/beam/pull/12593
>> > >
>> > > Please review and feel free to share any feedback or questions.
>> > >
>> > > Regards,
>> > > Saikat
>> > >
>> >
>>
>


Re: BEAM-9045 Implement an Ignite runner using Apache Ignite compute grid

2020-08-18 Thread Valentin Kulichenko
Hi Saikat,

Thanks for clarifying. Is there a Beam component that monitors the state,
or this is up to the application? If something fails, will the application
have to retry the whole pipeline?

My concern is that Ignite compute actually provides very limited
guarantees, especially for the async execution. There are some failover
mechanisms, but overall it's up to the application to track the state and
retry. Moreover, if the application fails, all jobs it has submitted are
canceled.

I'm thinking that Ignite should have a reactive event-based processing
engine. The basic idea is this:
- an application submits an event into the cluster
- the event is persisted in Ignite to be eventually processed
- a processed event may result in some new events that are submitted in the
similar fashion

Ignite will provide the at-least-once guarantee (or even exactly-once under
certain assumptions) for all the event handlers, so a user can create a
whole chain by submitting a single event, and they don't have to worry
about failures - it's up to Ignite to handle them.

It seems to me that it might be beneficial for the Beam runner to have such
an engine under the hood. What do you think?

-Val

On Mon, Aug 17, 2020 at 6:00 PM Saikat Maitra 
wrote:

> Hi,
>
> Luke - Thank you for sharing the details for the portability layer for
> Flink, Samza and Spark. I will look into them and will reach out if I have
> any questions.
>
> Val - Thank you for your response, yes I am planning to run the beam
> pipeline using Ignite compute engine in async run. Here is a sample code
> for the run method.
>
> IgnitePipelineResult pipelineResult = new IgnitePipelineResult(job,
> metricsAccumulator);
> ComputeTaskFuture computeTaskFuture =
> ignite.compute().withAsync().run(
> (r, f) -> {
>   pipelineResult.freeze(f);
>   metricsAccumulator.destroy();
>   ignite.shutdown();
> });
> pipelineResult.setComputeFuture(asyncCompute.future());
>
> return pipelineResult;
>
>
> My understanding is for failover scenarios we will need to map the job
> state from Ignite known state to Beam Job state, an example like in
> JetPipelineResult
>
> https://github.com/apache/beam/blob/master/runners/jet/src/main/java/org/apache/beam/runners/jet/JetPipelineResult.java#L68-L90
>
> Regards,
> Saikat
>
>
>
>
>
>
> On Mon, Aug 17, 2020 at 2:27 PM Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
>
> > Hi Saikat,
> >
> > This sounds very interesting - I've been thinking about how Ignite
> compute
> > engine could be enhanced, and integration with Apache Beam is one of the
> > options I have in mind. Can you please describe how you plan to implement
> > this? Will it run on top of the Ignite Compute Grid? How are you going to
> > handle the failover, especially in the case of async pipeline execution?
> >
> > -Val
> >
> > On Sat, Aug 15, 2020 at 12:50 PM Saikat Maitra 
> > wrote:
> >
> > > Hi,
> > >
> > > I have been working on implementing the Apache Ignite Runner to run
> > Apache
> > > Beam pipeline. I have created IgniteRunner and IgnitePipelineOptions. I
> > > have implemented the normalize pipeline method and currently working on
> > run
> > > method implementation for Pipeline and IgnitePipelineTranslator.
> > >
> > > Jira : https://issues.apache.org/jira/browse/BEAM-9045
> > >
> > > PR : https://github.com/apache/beam/pull/12593
> > >
> > > Please review and feel free to share any feedback or questions.
> > >
> > > Regards,
> > > Saikat
> > >
> >
>


Re: python AfterCount() trigger behavior not aligned with java

2020-08-18 Thread Leiyi Zhang
Thank you Robert!

On Mon, Aug 17, 2020 at 5:52 PM Robert Bradshaw  wrote:

> Correct, everything is per-key. To allow triggering after n events you
> would have to given them all the same key. (Note that this would
> potentially introduce a bottleneck, as they would all be shuffled to
> the same machine.)
>
> On Mon, Aug 17, 2020 at 4:01 PM Leiyi Zhang  wrote:
> >
> > Thank you for explaining the details Robert!
> >
> > I do have 1 more question: is there a way in beam that allows triggering
> after n events arrived at the gbk step from previous parts of a pipeline?
> because afterCount() trigger for gbk is per-key, then it cannot be used
> here right?
> >
> > On Mon, Aug 17, 2020 at 3:22 PM Robert Bradshaw 
> wrote:
> >>
> >> Yeah, this is another subtlety. There's a notion of "window garbage
> >> collection" that's distinct from "window closing." Garbage collection
> >> happens regardless of whether the trigger was set iff the window is
> >> non-empty when the watermark + allowed lateness exceeds the end of
> >> window. (Well, there's a flag to control this behavior.) This has
> >> changed over time and I think you've uncovered a bug that the two SDKs
> >> are not consistent here.
> >>
> >> On Mon, Aug 17, 2020 at 2:29 PM Leiyi Zhang  wrote:
> >> >
> >> > is it just for the python sdk or for both python and java sdk?
> >> > seems like java sdk will output result even if there are less than 3
> elements per key.
> >> >
> >> > On Mon, Aug 17, 2020 at 2:20 PM Robert Bradshaw 
> wrote:
> >> >>
> >> >> Yes, GBK is non-determanistic in the face of triggers as well. All
> >> >> triggers are per-key, evaluated independently for each key, so it'd
> be
> >> >> "do I have at least 3 results for this key."
> >> >>
> >> >> On Mon, Aug 17, 2020 at 2:08 PM Leiyi Zhang 
> wrote:
> >> >> >
> >> >> > Thank you very much for the reply,
> >> >> > Is the result of gbk non-deterministic as well? between "do I have
> at least 3 results PER KEY" vs "do I have at least 3 incoming events before
> I trigger GBK"
> >> >> >
> >> >> > On Mon, Aug 17, 2020 at 1:39 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >> >> >>
> >> >> >> Triggers in beam are non-determanistic; both behaviors are
> acceptable
> >> >> >> (especially for batch mode). In practice, production runners
> evaluate
> >> >> >> triggers (e.g. in this case "do I have at least two elements")
> whenver
> >> >> >> a new batch of data comes in (for the Python direct runner, in
> batch
> >> >> >> mode, all the data comes in at once). To have more control over
> this
> >> >> >> you can use TestPipeline, which will attempt to fire triggers as
> >> >> >> eagerly as possible.
> >> >> >>
> >> >> >> On Mon, Aug 17, 2020 at 1:16 PM Leiyi Zhang 
> wrote:
> >> >> >> >
> >> >> >> > for GBK wtih AfterCount(3), java sdk results in this and python
> sdk results in this
> >> >> >> >
> >> >> >> > for global count with aftercount(2), java sdk results in 2 2 2
> 2 and python sdk results in 8.
> >> >> >> >
> >> >> >> > On Mon, Aug 17, 2020 at 12:40 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >> >> >> >>
> >> >> >> >> What results do you get in each?
> >> >> >> >>
> >> >> >> >> On Mon, Aug 17, 2020 at 11:55 AM Leiyi Zhang <
> lei...@google.com> wrote:
> >> >> >> >> >
> >> >> >> >> > Hi everyone!
> >> >> >> >> > I noticed that the behavior of AfterCount() trigger seems to
> be different between python sdk and the java one, so I created a few tests
> to show the difference, but in general I think the python sdk will buffer
> on result instead of input elements.
> >> >> >> >> >
> >> >> >> >> > What do you guys think?
> >> >> >> >> >
> >> >> >> >> > and here are the tests. I ran them in batch mode.
> >> >> >> >> >
> >> >> >> >> > Sincerely,
> >> >> >> >> > Leiyi
>


Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-08-18 Thread Pulasthi Supun Wickramasinghe
Hi Luke

Will take a look at this as soon as possible and get back to you.

Best Regards,
Pulasthi

On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik  wrote:

> I have made some good progress here and have gotten to the following state
> for non-portable runners:
>
> DirectRunner[1]: Merged. Supports Read.Bounded and Read.Unbounded.
> Twister2[2]: Ready for review. Supports Read.Bounded, the current runner
> doesn't support unbounded pipelines.
> Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not certain
> about level of unbounded pipeline support coverage since Spark uses its own
> tiny suite of tests to get unbounded pipeline coverage instead of the
> validates runner set.
> Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely needs
> additional work.
> Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of unbounded
> pipeline support coverage since Spark uses its own tiny suite of tests to
> get unbounded pipeline coverage instead of the validates runner set.
> Flink: Unstarted.
>
> @Pulasthi Supun Wickramasinghe  , can you help me
> with the Twister2 PR[2]?
> @Ismaël Mejía , is PR[3] the expected level of support
> for unbounded pipelines and hence ready for review?
> @Jozsef Bartok , can you help me out to get support
> for unbounded splittable DoFn's into Jet[4]?
> @Xinyu Liu , is PR[5] the expected level of
> support for unbounded pipelines and hence ready for review?
>
> 1: https://github.com/apache/beam/pull/12519
> 2: https://github.com/apache/beam/pull/12594
> 3: https://github.com/apache/beam/pull/12603
> 4: https://github.com/apache/beam/pull/12616
> 5: https://github.com/apache/beam/pull/12617
>
> On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik  wrote:
>
>> There shouldn't be any changes required since the wrapper will smoothly
>> transition the execution to be run as an SDF. New IOs should strongly
>> prefer to use SDF since it should be simpler to write and will be more
>> flexible but they can use the "*Source"-based APIs. Eventually we'll
>> deprecate the APIs but we will never stop supporting them. Eventually they
>> should all be migrated to use SDF and if there is another major Beam
>> version, we'll finally be able to remove them.
>>
>> On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Hi Luke,
>>>
>>> Great to hear about such progress on this!
>>>
>>> Talking about opt-out for all runners in the future, will it require any
>>> code changes for current “*Source”-based IOs or the wrappers should
>>> completely smooth this transition?
>>> Do we need to require to create new IOs only based on SDF or again, the
>>> wrappers should help to avoid this?
>>>
>>> On 10 Aug 2020, at 22:59, Luke Cwik  wrote:
>>>
>>> In the past couple of months wrappers[1, 2] have been added to the Beam
>>> Java SDK which can execute BoundedSource and UnboundedSource as Splittable
>>> DoFns. These have been opt-out for portable pipelines (e.g. Dataflow runner
>>> v2, XLang pipelines on Flink/Spark) and opt-in using an experiment for all
>>> other pipelines.
>>>
>>> I would like to start making the non-portable pipelines starting with
>>> the DirectRunner[3] to be opt-out with the plan that eventually all runners
>>> will only execute splittable DoFns and the BoundedSource/UnboundedSource
>>> specific execution logic from the runners will be removed.
>>>
>>> Users will be able to opt-in any pipeline using the experiment
>>> 'use_sdf_read' and opt-out with the experiment 'use_deprecated_read'. (For
>>> portable pipelines these experiments were 'beam_fn_api' and
>>> 'beam_fn_api_use_deprecated_read' respectively and I have added these two
>>> additional aliases to make the experience less confusing).
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>>> 2:
>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>>> 3: https://github.com/apache/beam/pull/12519
>>>
>>>
>>>

-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035


Re: Percentile metrics in Beam

2020-08-18 Thread Luke Cwik
getPMForCDF[1] seems to return a CDF and you can choose the split points
(b0, b1, b2, ...).

1:
https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DoublesPmfCdfImpl.java#L16

On Tue, Aug 18, 2020 at 11:20 AM Alex Amato  wrote:

> I'm a bit confused, are you sure that it is possible to derive the CDF?
> Using the moments variables.
>
> The linked implementation on github seems to not use a derived CDF
> equation, but instead using some sampling technique (which I can't fully
> grasp yet) to estimate how many elements are in each bucket.
>
> linearTimeIncrementHistogramCounters
>
> https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DoublesPmfCdfImpl.java#L117
>
> Calls into .get() to do some sort of sampling
>
> https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DirectDoublesSketchAccessor.java#L29
>
>
>
> On Tue, Aug 18, 2020 at 9:52 AM Ke Wu  wrote:
>
>> Hi Alex,
>>
>> It is great to know you are working on the metrics. Do you have any
>> concern if we add a Histogram type metrics in Samza Runner itself for now
>> so we can start using it before a generic histogram metrics can be
>> introduced in the Metrics class?
>>
>> Best,
>> Ke
>>
>> On Aug 18, 2020, at 12:57 AM, Gleb Kanterov  wrote:
>>
>> Hi Alex,
>>
>> I'm not sure about restoring histogram, because the use-case I had in the
>> past used percentiles. As I understand it, you can approximate histogram if
>> you know percentiles and total count. E.g. 5% of values fall into
>> [P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the paper
>> well enough to say how it's going to work if given bucket boundaries happen
>> to include a small number of values. I guess it's a similar kind of
>> trade-off when we need to choose boundaries if we want to get percentiles
>> from histogram buckets. I see primarily moment sketch as a method intended
>> to approximate percentiles, not histogram buckets.
>>
>> /Gleb
>>
>> On Tue, Aug 18, 2020 at 2:13 AM Alex Amato  wrote:
>>
>>> Hi Gleb, and Luke
>>>
>>> I was reading through the paper, blog and github you linked to. One
>>> thing I can't figure out is if it's possible to use the Moment Sketch to
>>> restore an original histogram.
>>> Given bucket boundaries: b0, b1, b2, b3, ...
>>> Can we obtain the counts for the number of values inserted each of the
>>> ranges: [-INF, B0), … [Bi, Bi+1), …
>>> (This is a requirement I need)
>>>
>>> Not be confused with the percentile/threshold based queries discussed in
>>> the blog.
>>>
>>> Luke, were you suggesting collecting both and sending both over the FN
>>> API wire? I.e. collecting both
>>>
>>>- the variables to represent the Histogram as suggested in
>>>https://s.apache.org/beam-histogram-metrics:
>>>- In addition to the moment sketch variables
>>>
>>> 
>>>.
>>>
>>> I believe that would be feasible, as we would still retain the Histogram
>>> data. I don't think we can restore the Histograms with just the Sketch, if
>>> that was the suggestion. Please let me know if I misunderstood.
>>>
>>> If that's correct, I can write up the benefits and drawbacks I see for
>>> both approaches.
>>>
>>>
>>> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  wrote:
>>>
 That is an interesting suggestion to change to use a sketch.

 I believe having one metric URN that represents all this information
 grouped together would make sense instead of attempting to aggregate
 several metrics together. The underlying implementation of using
 sum/count/max/min would stay the same but we would want a single object
 that abstracts this complexity away for users as well.

 On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  wrote:

> Didn't see proposal by Alex before today. I want to add a few more
> cents from my side.
>
> There is a paper Moment-based quantile sketches for efficient high
> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
> COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
> numbers, it uses solver for Chebyshev polynomials to get quantile number,
> and there is already Java implementation for it on GitHub [2].
>
> This way we can express quantiles using existing metric types in Beam,
> that can be already done without SDK or runner changes. It can fit nicely
> into existing runners and can be abstracted over if needed. I think this 
> is
> also one of the best implementations, it has < 1% error rate for 200 bytes
> of storage, and quite efficient to compute. Did we 

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-08-18 Thread Luke Cwik
I have made some good progress here and have gotten to the following state
for non-portable runners:

DirectRunner[1]: Merged. Supports Read.Bounded and Read.Unbounded.
Twister2[2]: Ready for review. Supports Read.Bounded, the current runner
doesn't support unbounded pipelines.
Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not certain
about level of unbounded pipeline support coverage since Spark uses its own
tiny suite of tests to get unbounded pipeline coverage instead of the
validates runner set.
Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely needs
additional work.
Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of unbounded
pipeline support coverage since Spark uses its own tiny suite of tests to
get unbounded pipeline coverage instead of the validates runner set.
Flink: Unstarted.

@Pulasthi Supun Wickramasinghe  , can you help me
with the Twister2 PR[2]?
@Ismaël Mejía , is PR[3] the expected level of support
for unbounded pipelines and hence ready for review?
@Jozsef Bartok , can you help me out to get support
for unbounded splittable DoFn's into Jet[4]?
@Xinyu Liu , is PR[5] the expected level of support
for unbounded pipelines and hence ready for review?

1: https://github.com/apache/beam/pull/12519
2: https://github.com/apache/beam/pull/12594
3: https://github.com/apache/beam/pull/12603
4: https://github.com/apache/beam/pull/12616
5: https://github.com/apache/beam/pull/12617

On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik  wrote:

> There shouldn't be any changes required since the wrapper will smoothly
> transition the execution to be run as an SDF. New IOs should strongly
> prefer to use SDF since it should be simpler to write and will be more
> flexible but they can use the "*Source"-based APIs. Eventually we'll
> deprecate the APIs but we will never stop supporting them. Eventually they
> should all be migrated to use SDF and if there is another major Beam
> version, we'll finally be able to remove them.
>
> On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko 
> wrote:
>
>> Hi Luke,
>>
>> Great to hear about such progress on this!
>>
>> Talking about opt-out for all runners in the future, will it require any
>> code changes for current “*Source”-based IOs or the wrappers should
>> completely smooth this transition?
>> Do we need to require to create new IOs only based on SDF or again, the
>> wrappers should help to avoid this?
>>
>> On 10 Aug 2020, at 22:59, Luke Cwik  wrote:
>>
>> In the past couple of months wrappers[1, 2] have been added to the Beam
>> Java SDK which can execute BoundedSource and UnboundedSource as Splittable
>> DoFns. These have been opt-out for portable pipelines (e.g. Dataflow runner
>> v2, XLang pipelines on Flink/Spark) and opt-in using an experiment for all
>> other pipelines.
>>
>> I would like to start making the non-portable pipelines starting with the
>> DirectRunner[3] to be opt-out with the plan that eventually all runners
>> will only execute splittable DoFns and the BoundedSource/UnboundedSource
>> specific execution logic from the runners will be removed.
>>
>> Users will be able to opt-in any pipeline using the experiment
>> 'use_sdf_read' and opt-out with the experiment 'use_deprecated_read'. (For
>> portable pipelines these experiments were 'beam_fn_api' and
>> 'beam_fn_api_use_deprecated_read' respectively and I have added these two
>> additional aliases to make the experience less confusing).
>>
>> 1:
>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>> 2:
>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>> 3: https://github.com/apache/beam/pull/12519
>>
>>
>>


Re: Percentile metrics in Beam

2020-08-18 Thread Alex Amato
I'm a bit confused, are you sure that it is possible to derive the CDF?
Using the moments variables.

The linked implementation on github seems to not use a derived CDF
equation, but instead using some sampling technique (which I can't fully
grasp yet) to estimate how many elements are in each bucket.

linearTimeIncrementHistogramCounters
https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DoublesPmfCdfImpl.java#L117

Calls into .get() to do some sort of sampling
https://github.com/stanford-futuredata/msketch/blob/cf4e49e860761f48ebdeb00f650ce997c46073e2/javamsketch/quantilebench/src/main/java/yahoo/DirectDoublesSketchAccessor.java#L29



On Tue, Aug 18, 2020 at 9:52 AM Ke Wu  wrote:

> Hi Alex,
>
> It is great to know you are working on the metrics. Do you have any
> concern if we add a Histogram type metrics in Samza Runner itself for now
> so we can start using it before a generic histogram metrics can be
> introduced in the Metrics class?
>
> Best,
> Ke
>
> On Aug 18, 2020, at 12:57 AM, Gleb Kanterov  wrote:
>
> Hi Alex,
>
> I'm not sure about restoring histogram, because the use-case I had in the
> past used percentiles. As I understand it, you can approximate histogram if
> you know percentiles and total count. E.g. 5% of values fall into
> [P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the paper
> well enough to say how it's going to work if given bucket boundaries happen
> to include a small number of values. I guess it's a similar kind of
> trade-off when we need to choose boundaries if we want to get percentiles
> from histogram buckets. I see primarily moment sketch as a method intended
> to approximate percentiles, not histogram buckets.
>
> /Gleb
>
> On Tue, Aug 18, 2020 at 2:13 AM Alex Amato  wrote:
>
>> Hi Gleb, and Luke
>>
>> I was reading through the paper, blog and github you linked to. One thing
>> I can't figure out is if it's possible to use the Moment Sketch to restore
>> an original histogram.
>> Given bucket boundaries: b0, b1, b2, b3, ...
>> Can we obtain the counts for the number of values inserted each of the
>> ranges: [-INF, B0), … [Bi, Bi+1), …
>> (This is a requirement I need)
>>
>> Not be confused with the percentile/threshold based queries discussed in
>> the blog.
>>
>> Luke, were you suggesting collecting both and sending both over the FN
>> API wire? I.e. collecting both
>>
>>- the variables to represent the Histogram as suggested in
>>https://s.apache.org/beam-histogram-metrics:
>>- In addition to the moment sketch variables
>>
>> 
>>.
>>
>> I believe that would be feasible, as we would still retain the Histogram
>> data. I don't think we can restore the Histograms with just the Sketch, if
>> that was the suggestion. Please let me know if I misunderstood.
>>
>> If that's correct, I can write up the benefits and drawbacks I see for
>> both approaches.
>>
>>
>> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  wrote:
>>
>>> That is an interesting suggestion to change to use a sketch.
>>>
>>> I believe having one metric URN that represents all this information
>>> grouped together would make sense instead of attempting to aggregate
>>> several metrics together. The underlying implementation of using
>>> sum/count/max/min would stay the same but we would want a single object
>>> that abstracts this complexity away for users as well.
>>>
>>> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  wrote:
>>>
 Didn't see proposal by Alex before today. I want to add a few more
 cents from my side.

 There is a paper Moment-based quantile sketches for efficient high
 cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
 depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
 COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
 numbers, it uses solver for Chebyshev polynomials to get quantile number,
 and there is already Java implementation for it on GitHub [2].

 This way we can express quantiles using existing metric types in Beam,
 that can be already done without SDK or runner changes. It can fit nicely
 into existing runners and can be abstracted over if needed. I think this is
 also one of the best implementations, it has < 1% error rate for 200 bytes
 of storage, and quite efficient to compute. Did we consider using that?

 [1]:
 https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
 [2]: https://github.com/stanford-futuredata/msketch

 On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  wrote:

> The distinction here is that even though these metrics come from user
> space, we still gave them specific URNs, which imply they have a specific

Re: Percentile metrics in Beam

2020-08-18 Thread Ke Wu
Hi Alex,

It is great to know you are working on the metrics. Do you have any concern if 
we add a Histogram type metrics in Samza Runner itself for now so we can start 
using it before a generic histogram metrics can be introduced in the Metrics 
class?

Best,
Ke

> On Aug 18, 2020, at 12:57 AM, Gleb Kanterov  wrote:
> 
> Hi Alex,
> 
> I'm not sure about restoring histogram, because the use-case I had in the 
> past used percentiles. As I understand it, you can approximate histogram if 
> you know percentiles and total count. E.g. 5% of values fall into [P95, +INF) 
> bucket, other 5% [P90, P95), etc. I don't understand the paper well enough to 
> say how it's going to work if given bucket boundaries happen to include a 
> small number of values. I guess it's a similar kind of trade-off when we need 
> to choose boundaries if we want to get percentiles from histogram buckets. I 
> see primarily moment sketch as a method intended to approximate percentiles, 
> not histogram buckets.
> 
> /Gleb
> 
> On Tue, Aug 18, 2020 at 2:13 AM Alex Amato  > wrote:
> Hi Gleb, and Luke
> 
> I was reading through the paper, blog and github you linked to. One thing I 
> can't figure out is if it's possible to use the Moment Sketch to restore an 
> original histogram.
> Given bucket boundaries: b0, b1, b2, b3, ... 
> Can we obtain the counts for the number of values inserted each of the 
> ranges: [-INF, B0), … [Bi, Bi+1), …
> (This is a requirement I need)
> 
> Not be confused with the percentile/threshold based queries discussed in the 
> blog.
> 
> Luke, were you suggesting collecting both and sending both over the FN API 
> wire? I.e. collecting both
> the variables to represent the Histogram as suggested in 
> https://s.apache.org/beam-histogram-metrics 
> :
> In addition to the moment sketch variables 
> .
> I believe that would be feasible, as we would still retain the Histogram 
> data. I don't think we can restore the Histograms with just the Sketch, if 
> that was the suggestion. Please let me know if I misunderstood.
> 
> If that's correct, I can write up the benefits and drawbacks I see for both 
> approaches.
> 
> 
> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  > wrote:
> That is an interesting suggestion to change to use a sketch.
> 
> I believe having one metric URN that represents all this information grouped 
> together would make sense instead of attempting to aggregate several metrics 
> together. The underlying implementation of using sum/count/max/min would stay 
> the same but we would want a single object that abstracts this complexity 
> away for users as well.
> 
> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  > wrote:
> Didn't see proposal by Alex before today. I want to add a few more cents from 
> my side.
> 
> There is a paper Moment-based quantile sketches for efficient high 
> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20 
> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X), COUNT(X), 
> SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated numbers, it 
> uses solver for Chebyshev polynomials to get quantile number, and there is 
> already Java implementation for it on GitHub [2].
> 
> This way we can express quantiles using existing metric types in Beam, that 
> can be already done without SDK or runner changes. It can fit nicely into 
> existing runners and can be abstracted over if needed. I think this is also 
> one of the best implementations, it has < 1% error rate for 200 bytes of 
> storage, and quite efficient to compute. Did we consider using that?
> 
> [1]: 
> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>  
> 
> [2]: https://github.com/stanford-futuredata/msketch 
> 
> On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  > wrote:
> The distinction here is that even though these metrics come from user space, 
> we still gave them specific URNs, which imply they have a specific format, 
> with specific labels, etc.
> 
> That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN would 
> have less expectation for its format. Today the USER_COUNTER just expects 
> like labels (TRANSFORM, NAME, NAMESPACE).
> 
> We didn't decide on making a private API. But rather an API available to user 
> code for populating metrics with specific labels, and specific URNs. The same 
> API could pretty much be used for user USER_HISTOGRAM. with a default URN 
> chosen.
> Thats how I see it in my head at the moment.
> 
> 
> On Fri, Aug 14, 2020 

Re: Percentile metrics in Beam

2020-08-18 Thread Gleb Kanterov
Hi Alex,

I'm not sure about restoring histogram, because the use-case I had in the
past used percentiles. As I understand it, you can approximate histogram if
you know percentiles and total count. E.g. 5% of values fall into
[P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the paper
well enough to say how it's going to work if given bucket boundaries happen
to include a small number of values. I guess it's a similar kind of
trade-off when we need to choose boundaries if we want to get percentiles
from histogram buckets. I see primarily moment sketch as a method intended
to approximate percentiles, not histogram buckets.

/Gleb

On Tue, Aug 18, 2020 at 2:13 AM Alex Amato  wrote:

> Hi Gleb, and Luke
>
> I was reading through the paper, blog and github you linked to. One thing
> I can't figure out is if it's possible to use the Moment Sketch to restore
> an original histogram.
> Given bucket boundaries: b0, b1, b2, b3, ...
> Can we obtain the counts for the number of values inserted each of the
> ranges: [-INF, B0), … [Bi, Bi+1), …
> (This is a requirement I need)
>
> Not be confused with the percentile/threshold based queries discussed in
> the blog.
>
> Luke, were you suggesting collecting both and sending both over the FN API
> wire? I.e. collecting both
>
>- the variables to represent the Histogram as suggested in
>https://s.apache.org/beam-histogram-metrics:
>- In addition to the moment sketch variables
>
> 
>.
>
> I believe that would be feasible, as we would still retain the Histogram
> data. I don't think we can restore the Histograms with just the Sketch, if
> that was the suggestion. Please let me know if I misunderstood.
>
> If that's correct, I can write up the benefits and drawbacks I see for
> both approaches.
>
>
> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  wrote:
>
>> That is an interesting suggestion to change to use a sketch.
>>
>> I believe having one metric URN that represents all this information
>> grouped together would make sense instead of attempting to aggregate
>> several metrics together. The underlying implementation of using
>> sum/count/max/min would stay the same but we would want a single object
>> that abstracts this complexity away for users as well.
>>
>> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  wrote:
>>
>>> Didn't see proposal by Alex before today. I want to add a few more cents
>>> from my side.
>>>
>>> There is a paper Moment-based quantile sketches for efficient high
>>> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
>>> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
>>> COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
>>> numbers, it uses solver for Chebyshev polynomials to get quantile number,
>>> and there is already Java implementation for it on GitHub [2].
>>>
>>> This way we can express quantiles using existing metric types in Beam,
>>> that can be already done without SDK or runner changes. It can fit nicely
>>> into existing runners and can be abstracted over if needed. I think this is
>>> also one of the best implementations, it has < 1% error rate for 200 bytes
>>> of storage, and quite efficient to compute. Did we consider using that?
>>>
>>> [1]:
>>> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>>> [2]: https://github.com/stanford-futuredata/msketch
>>>
>>> On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  wrote:
>>>
 The distinction here is that even though these metrics come from user
 space, we still gave them specific URNs, which imply they have a specific
 format, with specific labels, etc.

 That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN
 would have less expectation for its format. Today the USER_COUNTER just
 expects like labels (TRANSFORM, NAME, NAMESPACE).

 We didn't decide on making a private API. But rather an API
 available to user code for populating metrics with specific labels, and
 specific URNs. The same API could pretty much be used for user
 USER_HISTOGRAM. with a default URN chosen.
 Thats how I see it in my head at the moment.


 On Fri, Aug 14, 2020 at 8:52 PM Robert Bradshaw 
 wrote:

> On Fri, Aug 14, 2020 at 7:35 PM Alex Amato  wrote:
> >
> > I am only tackling the specific metrics covered in (for the python
> SDK first, then Java). To collect latency of IO API RPCS, and store it in 
> a
> histogram.
> > https://s.apache.org/beam-gcp-debuggability
> >
> > User histogram metrics are unfunded, as far as I know. But you
> should be able to extend what I do for that project to the user metric use
> case. I agree, it won't be much more work to support that. I designed the
> histogram with the 

Re: Percentile metrics in Beam

2020-08-18 Thread Luke Cwik
You can use a cumulative distribution function over the sketch at b0, b1,
b2, b3, ... which will tell you the probability that any given value is <=
X. You multiply that probability against the total count (which is also
recorded as part of the sketch) to get an estimate for the number of values
<= X. If you do this for b0, b1, b2, b3, ... you'll then be able to compute
an estimate for the number of values in each bucket. Note that you're
restoring estimates for how large each bucket is and as such there is some
error.

I was suggesting collecting one or the other, not both. Even if we get this
wrong, we can always swap to the other method by defining a new metric URN
and changing the underlying implementation and what is sent but keeping the
user facing API the same.

On Mon, Aug 17, 2020 at 5:13 PM Alex Amato  wrote:

> Hi Gleb, and Luke
>
> I was reading through the paper, blog and github you linked to. One thing
> I can't figure out is if it's possible to use the Moment Sketch to restore
> an original histogram.
> Given bucket boundaries: b0, b1, b2, b3, ...
> Can we obtain the counts for the number of values inserted each of the
> ranges: [-INF, B0), … [Bi, Bi+1), …
> (This is a requirement I need)
>
> Not be confused with the percentile/threshold based queries discussed in
> the blog.
>
> Luke, were you suggesting collecting both and sending both over the FN API
> wire? I.e. collecting both
>
>- the variables to represent the Histogram as suggested in
>https://s.apache.org/beam-histogram-metrics:
>- In addition to the moment sketch variables
>
> 
>.
>
> I believe that would be feasible, as we would still retain the Histogram
> data. I don't think we can restore the Histograms with just the Sketch, if
> that was the suggestion. Please let me know if I misunderstood.
>
> If that's correct, I can write up the benefits and drawbacks I see for
> both approaches.
>
>
> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  wrote:
>
>> That is an interesting suggestion to change to use a sketch.
>>
>> I believe having one metric URN that represents all this information
>> grouped together would make sense instead of attempting to aggregate
>> several metrics together. The underlying implementation of using
>> sum/count/max/min would stay the same but we would want a single object
>> that abstracts this complexity away for users as well.
>>
>> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  wrote:
>>
>>> Didn't see proposal by Alex before today. I want to add a few more cents
>>> from my side.
>>>
>>> There is a paper Moment-based quantile sketches for efficient high
>>> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
>>> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
>>> COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
>>> numbers, it uses solver for Chebyshev polynomials to get quantile number,
>>> and there is already Java implementation for it on GitHub [2].
>>>
>>> This way we can express quantiles using existing metric types in Beam,
>>> that can be already done without SDK or runner changes. It can fit nicely
>>> into existing runners and can be abstracted over if needed. I think this is
>>> also one of the best implementations, it has < 1% error rate for 200 bytes
>>> of storage, and quite efficient to compute. Did we consider using that?
>>>
>>> [1]:
>>> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>>> [2]: https://github.com/stanford-futuredata/msketch
>>>
>>> On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  wrote:
>>>
 The distinction here is that even though these metrics come from user
 space, we still gave them specific URNs, which imply they have a specific
 format, with specific labels, etc.

 That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN
 would have less expectation for its format. Today the USER_COUNTER just
 expects like labels (TRANSFORM, NAME, NAMESPACE).

 We didn't decide on making a private API. But rather an API
 available to user code for populating metrics with specific labels, and
 specific URNs. The same API could pretty much be used for user
 USER_HISTOGRAM. with a default URN chosen.
 Thats how I see it in my head at the moment.


 On Fri, Aug 14, 2020 at 8:52 PM Robert Bradshaw 
 wrote:

> On Fri, Aug 14, 2020 at 7:35 PM Alex Amato  wrote:
> >
> > I am only tackling the specific metrics covered in (for the python
> SDK first, then Java). To collect latency of IO API RPCS, and store it in 
> a
> histogram.
> > https://s.apache.org/beam-gcp-debuggability
> >
> > User histogram metrics are unfunded, as far as I know. But you
> should be able to extend what I do for that project to 

Re: Welcome Sruthi Sree Kumar - Season of Docs tech writer

2020-08-18 Thread Sruthi Sree Kumar
Thank you. Looking forward to working with the Beam community. :)

On Mon, Aug 17, 2020 at 11:57 PM Pablo Estrada  wrote:

> Welcome Sruthi! : )
>
> On Mon, Aug 17, 2020 at 2:41 PM Gris Cuevas  wrote:
>
>> Welcome Sruthi!
>>
>> On 2020/08/17 20:56:40, Aizhamal Nurmamat kyzy 
>> wrote:
>> > Hi all,
>> >
>> >
>> > Apache Beam was selected as one of the finalists for the Season of Docs
>> > program and we have been assigned 1 technical writer [1]!
>> >
>> >
>> > And it is my pleasure to welcome Sruthi Sree Kumar to the Beam
>> community,
>> > who will be working with us on improving the runner comparison page and
>> > capability matrix [2].
>> >
>> >
>> > Documentation development opens officially on September 14 and runs
>> through
>> > November 30, until then Sruthi will be getting to know the community and
>> > contribution processes better, and refine the goals of their technical
>> > writing projects directly working with their assigned mentor - Pablo
>> > Estrada.
>> >
>> > Welcome, Sruthi! Really looking forward to working with you!
>> >
>> >
>> > [1] https://developers.google.com/season-of-docs/docs/participants/
>> >
>> > [2]
>> https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs
>> >
>>
>

-- 
Regards,

Sruthi
LinkedIn 


Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

2020-08-18 Thread David Janíček
I looked at the possibility to fix the underlying filesystem and it turns
out that only the local filesystem couldn't handle decoding right, HDFS and
some other filesystem, e.g. S3, already have a check for that.
So I added a similar check to the local filesystem too. The implementation
is in the same pull request https://github.com/apache/beam/pull/12050.

Can you take a look at it, please?

Thanks,
David

út 11. 8. 2020 v 19:39 odesílatel Luke Cwik  napsal:

> The filesystem "fixes" all surmount to removing the "isDirectory" boolean
> bit and encoding whether something is a directory in the string part of the
> resource specification which also turns out to be backwards incompatible
> (just in a different way).
>
> Removing the "directory" bit would be great and that would allow us to use
> strings instead of resource ids but would require filesystems to perform
> the mapping from some standard path specification to their internal
> representation.
>
> On Wed, Aug 5, 2020 at 9:26 PM Chamikara Jayalath 
> wrote:
>
>> So, based on the comments in the PR, the underlying issue seems to be
>> 'FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));'
>> not returning the correct result, right ?
>> If so I think the correct fix might be your proposal (2) - Try to fix the
>> underlying filesystem to do a better job of file/dir matching
>>
>> This is a bug we probably have to fix anyways for the local filesystem
>> and/or HDFS and this will also give us a solution that does not break
>> update compatibility.
>>
>> Thanks,
>> Cham
>>
>> On Wed, Aug 5, 2020 at 3:41 PM Luke Cwik  wrote:
>>
>>> Cham, that was one of the options I had mentioned on the PR. The
>>> difference here is that this is a bug fix and existing users could be
>>> broken unknowingly so it might be worthwhile to take that breaking change
>>> (and possibly provide users a way to perform an upgrade using the old
>>> implementation).
>>>
>>>
>>> On Wed, Aug 5, 2020 at 3:33 PM Chamikara Jayalath 
>>> wrote:
>>>
 This might break the update compatibility for Dataflow streaming
 pipelines. +Reuven Lax   +Lukasz Cwik
 

 In other cases, to save update compatibility, we introduced a user
 option that changes the coder only when the user explicitly asks for an
 updated feature that requires the new coder. For example,
 https://github.com/apache/beam/commit/304882caa89afe24150062b959ee915c79e72ab3

 Thanks,
 Cham


 On Mon, Aug 3, 2020 at 10:00 AM David Janíček 
 wrote:

> Hello everyone,
>
> I've reported an issue
> https://issues.apache.org/jira/browse/BEAM-10292 which is about
> broken DefaultFilenamePolicy.ParamsCoder behavior.
> DefaultFilenamePolicy.ParamsCoder loses information whether
> DefaultFilenamePolicy.Params's baseFilename resource is file or directory
> on some filesystems, at least on local FS and HDFS.
>
> After discussion with @dmvk and @lukecwik, we have agreed that the
> best solution could be to take the breaking change and use ResourceIdCoder
> for encoding/decoding DefaultFilenamePolicy.Params's baseFilename, this 
> way
> the file/directory information is preserved.
> The solution is implemented in pull request
> https://github.com/apache/beam/pull/12050.
>
> I'd like to ask if there is a consensus on this breaking change. Is
> everyone OK with this?
> Thanks in advance for answers.
>
> Best regards,
> David
>


-- 
S pozdravem
David Janíček