from:"Eugene Kirpichov"

Re: [PROPOSAL] Preparing for Beam 2.24.0 release

2020-08-27 Thread Eugene Kirpichov

Hi Daniel,

This is super helpful, thank you for the update!

On Thu, Aug 27, 2020 at 2:09 PM Daniel Oliveira 
wrote:

> Hey Eugene,
>
> That Jira is a bit misleading, it's still tracking a root cause, but a
> workaround was submitted so it's no longer blocking the release. I'll
> remove the release tag from it to avoid that confusion.
>
> I've been trying to get a release candidate out since last Thursday, but
> between several bugs I ran into and other time-sensitive work that delayed
> me, it's taken a while. I've been getting help from some previous release
> managers thankfully, or it probably would've taken even longer. Anyway,
> following the release guide
> <https://beam.apache.org/contribute/release-guide/#7-build-a-release-candidate>,
> I just finished step 7 last night after working around the last bug that
> was blocking me, and I'm continuing from there today, so hopefully I'll be
> able to have the release candidate ready before the week is up.
>
> Hope the update is helpful,
> Daniel Oliveira
>
> On Thu, Aug 27, 2020 at 12:56 PM Eugene Kirpichov 
> wrote:
>
>> Hi!
>>
>> Just wondering how the progress on 2.24 has been?
>> I see the version in JIRA
>> https://issues.apache.org/jira/projects/BEAM/versions/12347146 is
>> blocked only by https://issues.apache.org/jira/browse/BEAM-10663 which
>> hasn't seen much action in the last week. Is there something specific
>> people can help with?
>>
>> Thanks!
>>
>> On 2020/08/12 01:59:20, Daniel Oliveira  wrote:
>> > I'd like to send out a last minute reminder to fill out CHANGES.md>
>> > <https://github.com/apache/beam/blob/master/CHANGES.md> with any
>> major>
>> > changes that are going to be in 2.24.0. If you need a quick review for>
>> > that, just add me as a reviewer to your PR (GitHub username is
>> "youngoli").>
>> > I'll keep an eye out for those until around 5 PM.>
>> >
>> > On another note, I need some help with setup from the release guide>
>> > <
>> https://beam.apache.org/contribute/release-guide/#one-time-setup-instructions>>
>>
>> > :>
>> > 1. I need someone to add me as a maintainer of the apache-beam package
>> on>
>> > PyPI. Username: danoliveira>
>> > 2. Someone might need to create a new version in JIRA>
>> > <
>> https://beam.apache.org/contribute/release-guide/#create-a-new-version-in-jira>.>
>>
>> > I'm not sure about this one because 2.25.0 already exists, I don't know
>> if>
>> > 2.26.0 needs to be created or if that's for the next release.>
>> >
>> > On Mon, Aug 10, 2020 at 8:27 PM Daniel Oliveira >
>> > wrote:>
>> >
>> > > Hi everyone,>
>> > >>
>> > > It seems like there's no objections, so I'm preparing to cut the
>> release>
>> > > on Wednesday.>
>> > >>
>> > > As a reminder, if you have any release-blocking issues, please have a
>> JIRA>
>> > > and set "Fix version" to 2.24.0. For non-blocking issues, please set
>> "Fix>
>> > > version" only once the issue is actually resolved, otherwise it makes
>> it>
>> > > more difficult to differentiate release-blocking issues from
>> non-blocking.>
>> > >>
>> > > Thanks,>
>> > > Daniel Oliveira>
>> > >>
>> > > On Thu, Aug 6, 2020 at 4:53 PM Rui Wang  wrote:>
>> > >>
>> > >> Awesome!>
>> > >>>
>> > >>>
>> > >> -Rui>
>> > >>>
>> > >> On Thu, Aug 6, 2020 at 4:14 PM Ahmet Altay 
>> wrote:>
>> > >>>
>> > >>> +1 - Thank you Daniel!!>
>> > >>>>
>> > >>> On Wed, Jul 29, 2020 at 4:30 PM Daniel Oliveira >
>>
>> > >>> wrote:>
>> > >>>>
>> > >>>> > You probably meant 2.24.0.>
>> > >>>>>
>> > >>>> Thanks, yes I did. Mark "Fix Version/s" as "2.24.0" everyone. :)>
>> > >>>>>
>> > >>>> On Wed, Jul 29, 2020 at 4:14 PM Valentyn Tymofieiev <>
>> > >>>> valen...@google.com> wrote:>
>> > >>>>>
>> > >>>>> +1, Thanks Daniel!>
>> > >>>>>>
>> > >>>>> On Wed, Jul 29, 2020 at 4:04 PM Daniel Oliveira <>
>> > >>>>> danolive...@google.com> wrote:>
>> > >>>>>>
>> > >>>>>> Hi everyone,>
>> > >>>>>>>
>> > >>>>>> The next Beam release branch (2.24.0) is scheduled to be cut on>
>> > >>>>>> August 12 according to the release calendar [1].>
>> > >>>>>>>
>> > >>>>>> I'd like to volunteer to handle this release. Following the lead
>> of>
>> > >>>>>> previous release managers, I plan on cutting the branch on that
>> date and>
>> > >>>>>> cherrypicking in release-blocking fixes afterwards. So
>> unresolved release>
>> > >>>>>> blocking JIRA issues should have their "Fix Version/s" marked as
>> "2.23.0".>
>> > >>>>>>>
>> > >>>>> You probably meant 2.24.0 [1].>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Any comments or objections?>
>> > >>>>>>>
>> > >>>>>> Thanks,>
>> > >>>>>> Daniel Oliveira>
>> > >>>>>>>
>> > >>>>>> [1]>
>> > >>>>>>
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com>
>>
>> > >>>>>>>
>> > >>>>> [1]
>> https://issues.apache.org/jira/projects/BEAM/versions/12347146>
>> > >>>>>>
>> > >>>>>
>> >
>>
>

-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov

Re: [PROPOSAL] Preparing for Beam 2.24.0 release

2020-08-27 Thread Eugene Kirpichov

Hi!

Just wondering how the progress on 2.24 has been?
I see the version in JIRA
https://issues.apache.org/jira/projects/BEAM/versions/12347146 is blocked
only by https://issues.apache.org/jira/browse/BEAM-10663 which hasn't seen
much action in the last week. Is there something specific people can help
with?

Thanks!

On 2020/08/12 01:59:20, Daniel Oliveira  wrote:
> I'd like to send out a last minute reminder to fill out CHANGES.md>
>  with any major>
> changes that are going to be in 2.24.0. If you need a quick review for>
> that, just add me as a reviewer to your PR (GitHub username is
"youngoli").>
> I'll keep an eye out for those until around 5 PM.>
>
> On another note, I need some help with setup from the release guide>
> <
https://beam.apache.org/contribute/release-guide/#one-time-setup-instructions>>

> :>
> 1. I need someone to add me as a maintainer of the apache-beam package
on>
> PyPI. Username: danoliveira>
> 2. Someone might need to create a new version in JIRA>
> <
https://beam.apache.org/contribute/release-guide/#create-a-new-version-in-jira>.>

> I'm not sure about this one because 2.25.0 already exists, I don't know
if>
> 2.26.0 needs to be created or if that's for the next release.>
>
> On Mon, Aug 10, 2020 at 8:27 PM Daniel Oliveira >
> wrote:>
>
> > Hi everyone,>
> >>
> > It seems like there's no objections, so I'm preparing to cut the
release>
> > on Wednesday.>
> >>
> > As a reminder, if you have any release-blocking issues, please have a
JIRA>
> > and set "Fix version" to 2.24.0. For non-blocking issues, please set
"Fix>
> > version" only once the issue is actually resolved, otherwise it makes
it>
> > more difficult to differentiate release-blocking issues from
non-blocking.>
> >>
> > Thanks,>
> > Daniel Oliveira>
> >>
> > On Thu, Aug 6, 2020 at 4:53 PM Rui Wang  wrote:>
> >>
> >> Awesome!>
> >>>
> >>>
> >> -Rui>
> >>>
> >> On Thu, Aug 6, 2020 at 4:14 PM Ahmet Altay  wrote:>
> >>>
> >>> +1 - Thank you Daniel!!>
> 
> >>> On Wed, Jul 29, 2020 at 4:30 PM Daniel Oliveira >
> >>> wrote:>
> 
>  > You probably meant 2.24.0.>
> >
>  Thanks, yes I did. Mark "Fix Version/s" as "2.24.0" everyone. :)>
> >
>  On Wed, Jul 29, 2020 at 4:14 PM Valentyn Tymofieiev <>
>  valen...@google.com> wrote:>
> >
> > +1, Thanks Daniel!>
> >>
> > On Wed, Jul 29, 2020 at 4:04 PM Daniel Oliveira <>
> > danolive...@google.com> wrote:>
> >>
> >> Hi everyone,>
> >>>
> >> The next Beam release branch (2.24.0) is scheduled to be cut on>
> >> August 12 according to the release calendar [1].>
> >>>
> >> I'd like to volunteer to handle this release. Following the lead
of>
> >> previous release managers, I plan on cutting the branch on that
date and>
> >> cherrypicking in release-blocking fixes afterwards. So unresolved
release>
> >> blocking JIRA issues should have their "Fix Version/s" marked as
"2.23.0".>
> >>>
> > You probably meant 2.24.0 [1].>
> >>
> >>
> >> Any comments or objections?>
> >>>
> >> Thanks,>
> >> Daniel Oliveira>
> >>>
> >> [1]>
> >>
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com>

> >>>
> > [1] https://issues.apache.org/jira/projects/BEAM/versions/12347146>
> >>
> >
>

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-01 Thread Eugene Kirpichov

Kenn - I don't mean an enum of common closures, I mean expressing closures
in a restricted sub-language such as the language of SQL expressions. That
would only work if there is a portable way to interpret SQL expressions,
but if there isn't, maybe there should be - for the sake of, well,
expressing closures portably. Of course these would be closures that only
work with rows - but that seems powerful enough for many if not most
purposes.

For example, maybe the Java example:

 PCollection transactions = ...;
 transactions.apply(FileIO.writeDynamic()
 .by(Transaction::getType)
 .via(tx -> tx.getType().toFields(tx),  // Convert the data to be
written to CSVSink
  type -> new CSVSink(type.getFieldNames()))
 .to(".../path/to/")
 .withNaming(type -> defaultNaming(type + "-transactions", ".csv"));

could be written in Python as:

transactions | fileio.write_dynamic(
  by="it.type",  # "it" is implicitly available in these SQL expressions as
the same thing as the Java lambda argument
  format="it.fields",
  sink="CSV_SINK(it.type.field_names)",  # A bunch of preset sinks
supported in every language?
  to=".../path/to/",
  naming="DEFAULT_NAMING(CONCAT(it, '-transactions'), '.csv')")

Again, to be clear, I'm not suggesting to block what Ismael is proposing on
getting this done - getting this done wouldn't be a short term effort, but
seems potentially really nice.


On Wed, Jul 1, 2020 at 3:19 PM Robert Burke  wrote:

> From the Go side of the table, the Go language doesn't provide a mechanism
> to serialize or access closure data, which means DoFns can't be functional
> closures.This combined with the move to have the "Structural DoFns" be
> serialized using Beam Schemas, has the net result that if Go transforms are
> used for Cross Language, they will be configurable with a Schema of the
> configuration data.
>
> Of course, this just means that each language will probably provide
> whichever mechanisms it likes for use of it's cross language transforms.
>
> On Tue, 30 Jun 2020 at 16:07, Kenneth Knowles  wrote:
>
>> I don't think an enum of most common closures will work. The input types
>> are typically generics that are made concrete by the caller who also
>> provides the closures. I think Luke's (2) is the same idea as my "Java
>> still assembles it [using opaque Python closures/transforms]". It seems
>> like an approach to (3). Passing over actual code could address some cases,
>> but libraries become the issue.
>>
>> I think it is fair to say that "WriteAll" style would involve entering
>> unexplored territory.
>>
>> On the main topic, I think Brian has a pretty strong point and his
>> example of type conversion lambdas is a good example. I did a quick survey
>> and every other property I could find does seem like it fits on the Read,
>> and most IOs have a few of these closures for example also extracting
>> timestamps. So maybe just a resolution convention of putting them on the
>> ReadAll and that taking precedence. Then you would be deserializing a Read
>> transform with insta-crash methods or some such?
>>
>> Kenn
>>
>> On Tue, Jun 30, 2020 at 10:24 AM Eugene Kirpichov 
>> wrote:
>>
>>> Yeah, mainly I just feel like dynamic reads and dynamic writes (and
>>> perhaps not-yet-invented similar transforms of other kinds) are tightly
>>> related - they are either very similar, or are duals of each other - so
>>> they should use the same approach. If they are using different approaches,
>>> it is a sign that either one of them is being done wrong or that we are
>>> running into a fundamental limitation of Beam (e.g. difficulty of encoding
>>> closures compared to encoding elements).
>>>
>>> But I agree with Luke that we shouldn't give up on closures. Especially
>>> with the work that has been done on schemas and SQL, I see no reason why we
>>> couldn't express closures in a portable restricted sub-language. If we can
>>> express SQL, we can express many or most use cases of dynamic reads/writes
>>> - I don't mean that we should actually use SQL (though we *could* -
>>> e.g. SQL scalar expressions seem powerful enough to express the closures
>>> appearing in most use cases of FileIO.writeDynamic), I just mean that SQL
>>> is an existence proof.
>>>
>>> (I don't want to rock the boat too much, just thought I'd chime in as
>>> this topic is dear to my heart)
>>>
>>> On Tue, Jun 30, 2020 at 9:59 AM Luke Cwik  wrote:
>>>
>>>> Kenn, I'm not too worried about closures since:
>>>

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-30 Thread Eugene Kirpichov

Yeah, mainly I just feel like dynamic reads and dynamic writes (and perhaps
not-yet-invented similar transforms of other kinds) are tightly related -
they are either very similar, or are duals of each other - so they should
use the same approach. If they are using different approaches, it is a sign
that either one of them is being done wrong or that we are running into a
fundamental limitation of Beam (e.g. difficulty of encoding closures
compared to encoding elements).

But I agree with Luke that we shouldn't give up on closures. Especially
with the work that has been done on schemas and SQL, I see no reason why we
couldn't express closures in a portable restricted sub-language. If we can
express SQL, we can express many or most use cases of dynamic reads/writes
- I don't mean that we should actually use SQL (though we *could* - e.g.
SQL scalar expressions seem powerful enough to express the closures
appearing in most use cases of FileIO.writeDynamic), I just mean that SQL
is an existence proof.

(I don't want to rock the boat too much, just thought I'd chime in as this
topic is dear to my heart)

On Tue, Jun 30, 2020 at 9:59 AM Luke Cwik  wrote:

> Kenn, I'm not too worried about closures since:
> 1) the expansion service for a transform could have a well set of defined
> closures by name that are returned as serialized objects that don't need to
> be interpretable by the caller
> 2) the language could store serialized functions of another language as
> constants
> 3) generic XLang function support will eventually be needed
> but I do agree that closures do make things difficult to express vs data
> which is why primarily why we should prefer data over closures when
> possible and use closures when expressing it with data would be too
> cumbersome.
>
> Brian, so far the cases that have been migrated have shown that the source
> descriptor and the Read transform are almost the same (some parameters that
> only impact pipeline construction such as coders differ).
>
> On Mon, Jun 29, 2020 at 2:33 PM Brian Hulette  wrote:
>
>> Sorry for jumping into this late and casting a vote against the
>> consensus... but I think I'd prefer standardizing on a pattern like
>> PCollection rather than PCollection. That
>> approach clearly separates the parameters that are allowed to vary across a
>> ReadAll (the ones defined in KafkaSourceDescriptor) from the parameters
>> that should be constant (other parameters in the Read object, like
>> SerializedFunctions for type conversions, parameters for different
>> operating modes, etc...). I think it's helpful to think of the parameters
>> that are allowed to vary as some "location descriptor", but I imagine IO
>> authors may want other parameters to vary across a ReadAll as well.
>>
>> To me it seems safer to let an IO author "opt-in" to a parameter being
>> dynamic at execution time.
>>
>> Brian
>>
>> On Mon, Jun 29, 2020 at 9:26 AM Eugene Kirpichov 
>> wrote:
>>
>>> I'd like to raise one more time the question of consistency between
>>> dynamic reads and dynamic writes, per my email at the beginning of the
>>> thread.
>>> If the community prefers ReadAll to read from Read, then should
>>> dynamicWrite's write to Write?
>>>
>>> On Mon, Jun 29, 2020 at 8:57 AM Boyuan Zhang  wrote:
>>>
>>>> It seems like most of us agree on the idea that ReadAll should read
>>>> from Read. I'm going to update the Kafka ReadAll with the same pattern.
>>>> Thanks for all your help!
>>>>
>>>> On Fri, Jun 26, 2020 at 12:12 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 26, 2020 at 11:49 AM Luke Cwik  wrote:
>>>>>
>>>>>> I would also like to suggest that transforms that implement ReadAll
>>>>>> via Read should also provide methods like:
>>>>>>
>>>>>> // Uses the specified values if unspecified in the input element from
>>>>>> the PCollection.
>>>>>> withDefaults(Read read);
>>>>>> // Uses the specified values regardless of what the input element
>>>>>> from the PCollection specifies.
>>>>>> withOverrides(Read read);
>>>>>>
>>>>>> and only adds methods that are required at construction time (e.g.
>>>>>> coders). This way the majority of documentation sits on the Read 
>>>>>> transform.
>>>>>>
>>>>>
>>>>> +0 from me. Sounds like benefits outweigh the drawbacks here and some

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Eugene Kirpichov

ata type that is schema-aware as the input of ReadAll.
>>>>>>> >>
>>>>>>> >> On Wed, Jun 24, 2020 at 7:42 PM Boyuan Zhang 
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Thanks for the summary, Cham!
>>>>>>> >>>
>>>>>>> >>> I think we can go with (2) and (4): use the data type that is
>>>>>>> schema-aware as the input of ReadAll.
>>>>>>> >>>
>>>>>>> >>> Converting Read into ReadAll helps us to stick with SDF-like IO.
>>>>>>> But only having  (3) is not enough to solve the problem of using 
>>>>>>> ReadAll in
>>>>>>> x-lang case.
>>>>>>> >>>
>>>>>>> >>> The key point of ReadAll is that the input type of ReadAll
>>>>>>> should be able to cross language boundaries and have compatibilities of
>>>>>>> updating/downgrading. After investigating some possibilities(pure java 
>>>>>>> pojo
>>>>>>> with custom coder, protobuf, row/schema) in Kafka usage, we find that
>>>>>>> row/schema fits our needs most. Here comes (4). I believe that using 
>>>>>>> Read
>>>>>>> as input of ReadAll makes sense in some cases, but I also think not all 
>>>>>>> IOs
>>>>>>> have the same need. I would treat Read as a special type as long as the
>>>>>>> Read is schema-aware.
>>>>>>> >>>
>>>>>>> >>> On Wed, Jun 24, 2020 at 6:34 PM Chamikara Jayalath <
>>>>>>> chamik...@google.com> wrote:
>>>>>>> >>>>
>>>>>>> >>>> I see. So it seems like there are three options discussed so
>>>>>>> far when it comes to defining source descriptors for ReadAll type 
>>>>>>> transforms
>>>>>>> >>>>
>>>>>>> >>>> (1) Use Read PTransform as the element type of the input
>>>>>>> PCollection
>>>>>>> >>>> (2) Use a POJO that describes the source as the data element of
>>>>>>> the input PCollection
>>>>>>> >>>> (3) Provide a converter as a function to the Read transform
>>>>>>> which essentially will convert it to a ReadAll (what Eugene mentioned)
>>>>>>> >>>>
>>>>>>> >>>> I feel like (3) is more suitable for a related set of source
>>>>>>> descriptions such as files.
>>>>>>> >>>> (1) will allow most code-reuse but seems like will make it hard
>>>>>>> to use the ReadAll transform as a cross-language transform and will 
>>>>>>> break
>>>>>>> the separation of construction time and runtime constructs
>>>>>>> >>>> (2) could result to less code reuse if not careful but will
>>>>>>> make the transform easier to be used as a cross-language transform 
>>>>>>> without
>>>>>>> additional modifications
>>>>>>> >>>>
>>>>>>> >>>> Also, with SDF, we can create ReadAll-like transforms that are
>>>>>>> more efficient. So we might be able to just define all sources in that
>>>>>>> format and make Read transforms just an easy to use composite built on 
>>>>>>> top
>>>>>>> of that (by adding a preceding Create transform).
>>>>>>> >>>>
>>>>>>> >>>> Thanks,
>>>>>>> >>>> Cham
>>>>>>> >>>>
>>>>>>> >>>> On Wed, Jun 24, 2020 at 11:10 AM Luke Cwik 
>>>>>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> I believe we do require PTransforms to be serializable since
>>>>>>> anonymous DoFns typically capture the enclosing PTransform.
>>>>>>> >>>>>
>>>>>>> >>>>> On Wed, Jun 24, 2020 at 10:52 AM Chamikara Jayalath <
>>>>>>> chamik...@google.com> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Eugene Kirpichov

Hi Ismael,

Thanks for taking this on. Have you considered an approach similar (or
dual) to FileIO.write(), where we in a sense also have to configure a
dynamic number different IO transforms of the same type (file writes)?

E.g. how in this example we configure many aspects of many file writes:

transactions.apply(FileIO.writeDynamic()
 .by(Transaction::getType)
 .via(tx -> tx.getType().toFields(tx),  // Convert the data to be
written to CSVSink
  type -> new CSVSink(type.getFieldNames()))
 .to(".../path/to/")
 .withNaming(type -> defaultNaming(type + "-transactions", ".csv"));

we could do something similar for many JdbcIO reads:

PCollection bars;  // user-specific type from which all the read
parameters can be inferred
PCollection moos = bars.apply(JdbcIO.readAll()
  .fromQuery(bar -> ...compute query for this bar...)
  .withMapper((bar, resultSet) -> new Moo(...))
  .withBatchSize(bar -> ...compute batch size for this bar...)
  ...etc);


On Wed, Jun 24, 2020 at 6:53 AM Ismaël Mejía  wrote:

> Hello,
>
> (my excuses for the long email but this requires context)
>
> As part of the move from Source based IOs to DoFn based ones. One pattern
> emerged due to the composable nature of DoFn. The idea is to have a
> different
> kind of composable reads where we take a PCollection of different sorts of
> intermediate specifications e.g. tables, queries, etc, for example:
>
> JdbcIO:
> ReadAll extends
> PTransform, PCollection>
>
> RedisIO:
> ReadAll extends PTransform, PCollection String>>>
>
> HBaseIO:
> ReadAll extends PTransform, PCollection>
>
> These patterns enabled richer use cases like doing multiple queries in the
> same
> Pipeline, querying based on key patterns or querying from multiple tables
> at the
> same time but came with some maintenance issues:
>
> - We ended up needing to add to the ReadAll transforms the parameters for
>   missing information so we ended up with lots of duplicated with methods
> and
>   error-prone code from the Read transforms into the ReadAll transforms.
>
> - When you require new parameters you have to expand the input parameters
> of the
>   intermediary specification into something that resembles the full `Read`
>   definition for example imagine you want to read from multiple tables or
>   servers as part of the same pipeline but this was not in the intermediate
>   specification you end up adding those extra methods (duplicating more
> code)
>   just o get close to the be like the Read full spec.
>
> - If new parameters are added to the Read method we end up adding them
>   systematically to the ReadAll transform too so they are taken into
> account.
>
> Due to these issues I recently did a change to test a new approach that is
> simpler, more complete and maintainable. The code became:
>
> HBaseIO:
> ReadAll extends PTransform, PCollection>
>
> With this approach users gain benefits of improvements on parameters of
> normal
> Read because they count with the full Read parameters. But of course there
> are
> some minor caveats:
>
> 1. You need to push some information into normal Reads for example
>partition boundaries information or Restriction information (in the SDF
>case).  Notice that this consistent approach of ReadAll produces a
> simple
>pattern that ends up being almost reusable between IOs (e.g. the
> non-SDF
>case):
>
>   public static class ReadAll extends PTransform,
> PCollection> {
> @Override
> public PCollection expand(PCollection input) {
>   return input
>   .apply("Split", ParDo.of(new SplitFn()))
>   .apply("Reshuffle", Reshuffle.viaRandomKey())
>   .apply("Read", ParDo.of(new ReadFn()));
> }
>   }
>
> 2. If you are using Generic types for the results ReadAll you must have the
>Coders used in its definition and require consistent types from the data
>sources, in practice this means we need to add extra withCoder
> method(s) on
>ReadAll but not the full specs.
>
>
> At the moment HBaseIO and SolrIO already follow this ReadAll pattern.
> RedisIO
> and CassandraIO have already WIP PRs to do so. So I wanted to bring this
> subject
> to the mailing list to see your opinions, and if you see any sort of
> issues that
> we might be missing with this idea.
>
> Also I would like to see if we have consensus to start using consistently
> the
> terminology of ReadAll transforms based on Read and the readAll() method
> for new
> IOs (at this point probably outdoing this in the only remaining
> inconsistent
> place in JdbcIO might not be a good idea but apart of this we should be
> ok).
>
> I mention this because the recent PR on KafkaIO based on SDF is doing
> something
> similar to the old pattern but being called ReadAll and maybe it is worth
> to be
> consistent for the benefit of users.
>
> Regards,
> Ismaël
>

Re: Per Element File Output Without writeDynamic

2019-12-03 Thread Eugene Kirpichov

Hi Christopher,

Thanks for clarifying. Then can you just preprocess the PCollection with a
custom FlatMapElements that converts each Document into one or more smaller
documents, small enough to be written into individual files? Then pair it
with a unique key and follow by FileIO.writeDynamic().by(the unique
key).withNumShards(1) to produce 1 file per document.

On Tue, Dec 3, 2019 at 7:55 AM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Hi Eugene,
>
> Yes I think you've got it correct. In our use case we need to write each
> Document in the PCollection to a separate file as multiple Documents in a
> file will cause compilation errors and/or incorrect code to be generated by
> the Thrift compiler.
>
> Additionally there are some Documents that are so large that we would want
> them to be split.
>
> On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov  wrote:
>
>> Hi Christopher,
>>
>> So, you have a PCollection, and you're writing it to files.
>> FileIO.write/writeDynamic will write several Document's to each file -
>> however, in your use case some of the individual Document's are so large
>> that you want instead each of those large documents to be split into
>> several files.
>>
>> Before we continue, could you confirm whether my understanding is correct?
>>
>> Thanks.
>>
>> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Ideally each element (document) will be written to a .thrift file so
>>> that it can be compiled without further manipulation.
>>>
>>> But in the case of an extremely large file I think it would be nice to
>>> split into smaller files. As far as splitting points go I think it could be
>>> split at a point in the list of definitions. Thoughts?
>>>
>>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:
>>>
>>>> What do you mean by shard the output file? Can it be split at any byte
>>>> location, or only at specific points?
>>>>
>>>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
>>>> christopher.lar...@quantiphi.com> wrote:
>>>>
>>>>> Hi Reuven,
>>>>>
>>>>> We would like to write each element to one file but still allow the
>>>>> runner to shard the output file which could yield more than one output 
>>>>> file
>>>>> per element.
>>>>>
>>>>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>>>>>
>>>>>> I'm not sure I completely understand the question. Are you saying
>>>>>> that you want each element to write to only one file, guaranteeing that 
>>>>>> two
>>>>>> elements are never written to the same file?
>>>>>>
>>>>>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
>>>>>> christopher.lar...@quantiphi.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> TL/DR: can you extend FileIO.sink to write one or more file per
>>>>>>> element instead of one or more elements per file?
>>>>>>>
>>>>>>> In working with Thrift files we have found that since a .thrift file
>>>>>>> needs to be compiled to generate code the order of the contents of the 
>>>>>>> file
>>>>>>> are important (ie, the namespace and includes elements need to come 
>>>>>>> before
>>>>>>> definitions are defined).
>>>>>>>
>>>>>>> The issue that we are facing is that by implementing
>>>>>>> FileIO.sink we cannot determine how many Document objects are
>>>>>>> written to a file since this is determined by the runner. This can 
>>>>>>> result
>>>>>>> in more than one Document being written to a file which will cause
>>>>>>> compilation errors.
>>>>>>>
>>>>>>> We know that this can be controlled by writeDynamic but since we
>>>>>>> believe the default behavior for the connector should be to output a
>>>>>>> Document to one or more files (depending on sharding) we were wondering 
>>>>>>> how
>>>>>>> to best accomplish this.
>>>>>>>
>>>>>>> Best,
>>>>>>> Chris
>>>>>>>
>>>>>>&g

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Eugene Kirpichov

Hi Christopher,

So, you have a PCollection, and you're writing it to files.
FileIO.write/writeDynamic will write several Document's to each file -
however, in your use case some of the individual Document's are so large
that you want instead each of those large documents to be split into
several files.

Before we continue, could you confirm whether my understanding is correct?

Thanks.

On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Ideally each element (document) will be written to a .thrift file so that
> it can be compiled without further manipulation.
>
> But in the case of an extremely large file I think it would be nice to
> split into smaller files. As far as splitting points go I think it could be
> split at a point in the list of definitions. Thoughts?
>
> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax  wrote:
>
>> What do you mean by shard the output file? Can it be split at any byte
>> location, or only at specific points?
>>
>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Hi Reuven,
>>>
>>> We would like to write each element to one file but still allow the
>>> runner to shard the output file which could yield more than one output file
>>> per element.
>>>
>>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax  wrote:
>>>
 I'm not sure I completely understand the question. Are you saying that
 you want each element to write to only one file, guaranteeing that two
 elements are never written to the same file?

 On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
 christopher.lar...@quantiphi.com> wrote:

> Hi All,
>
> TL/DR: can you extend FileIO.sink to write one or more file per
> element instead of one or more elements per file?
>
> In working with Thrift files we have found that since a .thrift file
> needs to be compiled to generate code the order of the contents of the 
> file
> are important (ie, the namespace and includes elements need to come before
> definitions are defined).
>
> The issue that we are facing is that by implementing
> FileIO.sink we cannot determine how many Document objects are
> written to a file since this is determined by the runner. This can result
> in more than one Document being written to a file which will cause
> compilation errors.
>
> We know that this can be controlled by writeDynamic but since we
> believe the default behavior for the connector should be to output a
> Document to one or more files (depending on sharding) we were wondering 
> how
> to best accomplish this.
>
> Best,
> Chris
>
> *This message contains information that may be privileged or
> confidential and is the property of the Quantiphi Inc and/or its 
> affiliates**.
> It is intended only for the person to whom it is addressed. **If you
> are not the intended recipient, any review, dissemination, distribution,
> copying, storage or other use of all or any portion of this message is
> strictly prohibited. If you received this message in error, please
> immediately notify the sender by reply e-mail and delete this message in
> its **entirety*
>

>>> *This message contains information that may be privileged or
>>> confidential and is the property of the Quantiphi Inc and/or its 
>>> affiliates**.
>>> It is intended only for the person to whom it is addressed. **If you
>>> are not the intended recipient, any review, dissemination, distribution,
>>> copying, storage or other use of all or any portion of this message is
>>> strictly prohibited. If you received this message in error, please
>>> immediately notify the sender by reply e-mail and delete this message in
>>> its **entirety*
>>>
>> --
> *Regards,*
>
> ___
>
> *Chris Larsen*
>
> Data Engineer | Quantiphi Inc. | US and India
>
> http://www.quantiphi.com | Analytics is in our DNA
>
> USA: +1 760 504 8477 <(760)%20504-8477>
> 
>
>
> *This message contains information that may be privileged or confidential
> and is the property of the Quantiphi Inc and/or its affiliates**. It is
> intended only for the person to whom it is addressed. **If you are not
> the intended recipient, any review, dissemination, distribution, copying,
> storage or other use of all or any portion of this message is strictly
> prohibited. If you received this message in error, please immediately
> notify the sender by reply e-mail and delete this message in its *
> *entirety*
>

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-20 Thread Eugene Kirpichov

[ ] Beaver
[ ] Hedgehog
[X] Lemur
[X] Owl
[ ] Salmon
[ ] Trout
[ ] Robot dinosaur
[X] Firefly
[X] Cuttlefish
[ ] Dumbo Octopus
[ ] Angler fish

On Wed, Nov 20, 2019 at 9:47 AM Valentyn Tymofieiev 
wrote:

> [ ] Beaver
> [X] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [X] Firefly
> [ ] Cuttlefish
> [ ] Dumbo Octopus
> [ ] Angler fish
>
> On Tue, Nov 19, 2019 at 6:44 PM Kenneth Knowles  wrote:
>
>> Please cast your votes of approval [1] for animals you would support as
>> Beam mascot. The animal with the most approval will be identified as the
>> favorite.
>>
>> *** Vote for as many as you like, using this checklist as a template 
>>
>> [ ] Beaver
>> [ ] Hedgehog
>> [ ] Lemur
>> [ ] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [ ] Firefly
>> [ ] Cuttlefish
>> [ ] Dumbo Octopus
>> [ ] Angler fish
>>
>> This vote will remain open for at least 72 hours.
>>
>> Kenn
>>
>> [1] See https://en.wikipedia.org/wiki/Approval_voting#Description and
>> https://www.electionscience.org/library/approval-voting/
>>
>

Re: RabbitMQ and CheckpointMark feasibility

2019-11-14 Thread Eugene Kirpichov

f this as a series of micro-bundles, where
micro-bundles are delimited by checkpoint marks, and each micro-bundle is a
runner-side transaction which either commits or discards the results of
processing all messages in this micro-bundle. After a micro-bundle [M1, M2)
commits, the runner calls M1.finalizeCheckpointMark() and persists M2 as
the new restore point in case of failure.

> So, tl;dr: I cannot find any means of maintaining a persistent connection
> to the server for finalizing checkpoints that is safe across runners. If
> there's a guarantee all of the shards are on the same JVM instance, I could
> rely on global, static collections/instances as a workaround, but if other
> runners might serialize this across the wire, I'm stumped. The only
> workable situation I can think of right now is to proactively acknowledge
> messages as they are received and effectively no-op in finalizeCheckpoint.
> This is very different, semantically, and can lead to dropped messages if a
> pipeline doesn't finish processing the given message.
>
> Any help would be much appreciated.
>
If I'm misunderstanding something above, could you describe in detail a
scenario that leads to message loss, or (less severe) to more-than-once
durable processing of the same message?


> Thanks,
> -Danny
> On 11/7/19 10:27 PM, Eugene Kirpichov wrote:
>
> Hi Daniel,
>
> This is probably insufficiently well documented. The CheckpointMark is
> used for two purposes:
> 1) To persistently store some notion of how much of the stream has been
> consumed, so that if something fails we can tell the underlying streaming
> system where to start reading when we re-create the reader. This is why
> CheckpointMark is Serializable. E.g. this makes sense for Kafka.
> 2) To do acks - to let the underlying streaming system know that the Beam
> pipeline will never need data up to this CheckpointMark. Acking does not
> require serializability - runners call ack() on the same in-memory instance
> of CheckpointMark that was produced by the reader. E.g. this makes sense
> for RabbitMq or Pubsub.
>
> In practice, these two capabilities tend to be mutually exclusive: some
> streaming systems can provide a serializable CheckpointMark, some can do
> acks, some can do neither - but very few (or none) can do both, and it's
> debatable whether it even makes sense for a system to provide both
> capabilities: usually acking is an implicit form of streaming-system-side
> checkpointing, i.e. when you re-create the reader you don't actually need
> to carry over any information from an old CheckpointMark - the necessary
> state (which records should be delivered) is maintained on the streaming
> system side.
>
> These two are lumped together into one API simply because that was the
> best design option we came up with (not for lack of trying, but suggestions
> very much welcome - AFAIK nobody is happy with it).
>
> RabbitMQ is under #2 - it can't do serializable checkpoint marks, but it
> can do acks. So you can simply ignore the non-serializability.
>
> On Thu, Nov 7, 2019 at 12:07 PM Daniel Robert 
> wrote:
>
>> (Background: I recently upgraded RabbitMqIO from the 4.x to 5.x library.
>> As part of this I switched to a pull-based API rather than the
>> previously-used push-based. This has caused some nebulous problems so
>> put up a correction PR that I think needs some eyes fairly quickly as
>> I'd consider master to be broken for rabbitmq right now. The PR keeps
>> the upgrade but reverts to the same push-based implementation as in 4.x:
>> https://github.com/apache/beam/pull/9977 )
>>
>> Regardless, in trying to get the pull-based API to work, I'm finding the
>> interactions between rabbitmq and beam with CheckpointMark to be
>> fundamentally impossible to implement so I'm hoping for some input here.
>>
>> CheckointMark itself must be Serializable, presumably this means it gets
>> shuffled around between nodes. However 'Channel', the tunnel through
>> which it communicates with Rabbit to ack messages and finalize the
>> checkpoint, is non-Serializable. Like most other CheckpointMark
>> implementations, Channel is 'transient'. When a new CheckpointMark is
>> instantiated, it's given a Channel. If an existing one is supplied to
>> the Reader's constructor (part of the 'startReader()' interface), the
>> channel is overwritten.
>>
>> *However*, Rabbit does not support 'ack'ing messages on a channel other
>> than the one that consumed them in the first place. Attempting to do so
>> results in a '406 (PRECONDITION-FAILED) - unknown delivery tag'. (See
>>
>> https://www.grzegorowski.com/rabbitmq-406-channel-closed-precondition-failed
>> ).
>>
>> Truthfully, I don't really understand how the current implementation is
>> working; it seems like a happy accident. But I'm curious if someone
>> could help me debug and implement how to bridge the
>> re-usable/serializable CheckpointMark requirement in Beam with this
>> limitation of Rabbit.
>>
>> Thanks,
>> -Daniel Robert
>>
>>

Re: RabbitMQ and CheckpointMark feasibility

2019-11-08 Thread Eugene Kirpichov

On Fri, Nov 8, 2019 at 5:57 AM Daniel Robert  wrote:

> Thanks Euguene and Reuven.
>
> In response to Eugene, I'd like to confirm I have this correct: In the
> rabbit-style use case of "stream-system-side checkpointing", it is safe
> (and arguably the correct behavior) to ignore the supplied CheckpointMark
> argument in `createReader(options, checkpointmark)` and in the constructor
> for the and instead always instantiate a new CheckpointMark during
> construction. Is that correct?
>
Yes, this is correct.


> In response to Reuven: noted, however I was mostly using serialization in
> the general sense. That is, there does not seem to be any means of
> deserializing a RabbitMqCheckpointMark such that it can continue to provide
> value to a runner. Whether it's java serialization, avro, or any other
> Coder, the 'channel' itself cannot "come along for the ride", which leaves
> the rest of the internal state mostly unusable except for perhaps some
> historical, immutable use case.
>
> -Danny
> On 11/8/19 2:01 AM, Reuven Lax wrote:
>
> Just to clarify one thing: CheckpointMark does not need to be Java
> Seralizable. All that's needed is do return a Coder for the CheckpointMark
> in getCheckpointMarkCoder.
>
> On Thu, Nov 7, 2019 at 7:29 PM Eugene Kirpichov  wrote:
>
>> Hi Daniel,
>>
>> This is probably insufficiently well documented. The CheckpointMark is
>> used for two purposes:
>> 1) To persistently store some notion of how much of the stream has been
>> consumed, so that if something fails we can tell the underlying streaming
>> system where to start reading when we re-create the reader. This is why
>> CheckpointMark is Serializable. E.g. this makes sense for Kafka.
>> 2) To do acks - to let the underlying streaming system know that the Beam
>> pipeline will never need data up to this CheckpointMark. Acking does not
>> require serializability - runners call ack() on the same in-memory instance
>> of CheckpointMark that was produced by the reader. E.g. this makes sense
>> for RabbitMq or Pubsub.
>>
>> In practice, these two capabilities tend to be mutually exclusive: some
>> streaming systems can provide a serializable CheckpointMark, some can do
>> acks, some can do neither - but very few (or none) can do both, and it's
>> debatable whether it even makes sense for a system to provide both
>> capabilities: usually acking is an implicit form of streaming-system-side
>> checkpointing, i.e. when you re-create the reader you don't actually need
>> to carry over any information from an old CheckpointMark - the necessary
>> state (which records should be delivered) is maintained on the streaming
>> system side.
>>
>> These two are lumped together into one API simply because that was the
>> best design option we came up with (not for lack of trying, but suggestions
>> very much welcome - AFAIK nobody is happy with it).
>>
>> RabbitMQ is under #2 - it can't do serializable checkpoint marks, but it
>> can do acks. So you can simply ignore the non-serializability.
>>
>> On Thu, Nov 7, 2019 at 12:07 PM Daniel Robert 
>> wrote:
>>
>>> (Background: I recently upgraded RabbitMqIO from the 4.x to 5.x library.
>>> As part of this I switched to a pull-based API rather than the
>>> previously-used push-based. This has caused some nebulous problems so
>>> put up a correction PR that I think needs some eyes fairly quickly as
>>> I'd consider master to be broken for rabbitmq right now. The PR keeps
>>> the upgrade but reverts to the same push-based implementation as in 4.x:
>>> https://github.com/apache/beam/pull/9977 )
>>>
>>> Regardless, in trying to get the pull-based API to work, I'm finding the
>>> interactions between rabbitmq and beam with CheckpointMark to be
>>> fundamentally impossible to implement so I'm hoping for some input here.
>>>
>>> CheckointMark itself must be Serializable, presumably this means it gets
>>> shuffled around between nodes. However 'Channel', the tunnel through
>>> which it communicates with Rabbit to ack messages and finalize the
>>> checkpoint, is non-Serializable. Like most other CheckpointMark
>>> implementations, Channel is 'transient'. When a new CheckpointMark is
>>> instantiated, it's given a Channel. If an existing one is supplied to
>>> the Reader's constructor (part of the 'startReader()' interface), the
>>> channel is overwritten.
>>>
>>> *However*, Rabbit does not support 'ack'ing messages on a channel other
>>> than the one that consumed them in the first place. Attempting to do so
>>> results in a '406 (PRECONDITION-FAILED) - unknown delivery tag'. (See
>>>
>>> https://www.grzegorowski.com/rabbitmq-406-channel-closed-precondition-failed
>>> ).
>>>
>>> Truthfully, I don't really understand how the current implementation is
>>> working; it seems like a happy accident. But I'm curious if someone
>>> could help me debug and implement how to bridge the
>>> re-usable/serializable CheckpointMark requirement in Beam with this
>>> limitation of Rabbit.
>>>
>>> Thanks,
>>> -Daniel Robert
>>>
>>>

Re: RabbitMQ and CheckpointMark feasibility

2019-11-07 Thread Eugene Kirpichov

Hi Daniel,

This is probably insufficiently well documented. The CheckpointMark is used
for two purposes:
1) To persistently store some notion of how much of the stream has been
consumed, so that if something fails we can tell the underlying streaming
system where to start reading when we re-create the reader. This is why
CheckpointMark is Serializable. E.g. this makes sense for Kafka.
2) To do acks - to let the underlying streaming system know that the Beam
pipeline will never need data up to this CheckpointMark. Acking does not
require serializability - runners call ack() on the same in-memory instance
of CheckpointMark that was produced by the reader. E.g. this makes sense
for RabbitMq or Pubsub.

In practice, these two capabilities tend to be mutually exclusive: some
streaming systems can provide a serializable CheckpointMark, some can do
acks, some can do neither - but very few (or none) can do both, and it's
debatable whether it even makes sense for a system to provide both
capabilities: usually acking is an implicit form of streaming-system-side
checkpointing, i.e. when you re-create the reader you don't actually need
to carry over any information from an old CheckpointMark - the necessary
state (which records should be delivered) is maintained on the streaming
system side.

These two are lumped together into one API simply because that was the best
design option we came up with (not for lack of trying, but suggestions very
much welcome - AFAIK nobody is happy with it).

RabbitMQ is under #2 - it can't do serializable checkpoint marks, but it
can do acks. So you can simply ignore the non-serializability.

On Thu, Nov 7, 2019 at 12:07 PM Daniel Robert  wrote:

> (Background: I recently upgraded RabbitMqIO from the 4.x to 5.x library.
> As part of this I switched to a pull-based API rather than the
> previously-used push-based. This has caused some nebulous problems so
> put up a correction PR that I think needs some eyes fairly quickly as
> I'd consider master to be broken for rabbitmq right now. The PR keeps
> the upgrade but reverts to the same push-based implementation as in 4.x:
> https://github.com/apache/beam/pull/9977 )
>
> Regardless, in trying to get the pull-based API to work, I'm finding the
> interactions between rabbitmq and beam with CheckpointMark to be
> fundamentally impossible to implement so I'm hoping for some input here.
>
> CheckointMark itself must be Serializable, presumably this means it gets
> shuffled around between nodes. However 'Channel', the tunnel through
> which it communicates with Rabbit to ack messages and finalize the
> checkpoint, is non-Serializable. Like most other CheckpointMark
> implementations, Channel is 'transient'. When a new CheckpointMark is
> instantiated, it's given a Channel. If an existing one is supplied to
> the Reader's constructor (part of the 'startReader()' interface), the
> channel is overwritten.
>
> *However*, Rabbit does not support 'ack'ing messages on a channel other
> than the one that consumed them in the first place. Attempting to do so
> results in a '406 (PRECONDITION-FAILED) - unknown delivery tag'. (See
>
> https://www.grzegorowski.com/rabbitmq-406-channel-closed-precondition-failed
> ).
>
> Truthfully, I don't really understand how the current implementation is
> working; it seems like a happy accident. But I'm curious if someone
> could help me debug and implement how to bridge the
> re-usable/serializable CheckpointMark requirement in Beam with this
> limitation of Rabbit.
>
> Thanks,
> -Daniel Robert
>
>

Re: [Discuss] Beam mascot

2019-11-04 Thread Eugene Kirpichov

Feels like "Beam" would go well with an animal that has glowing bright eyes
(with beams of light shooting out of them), such as a lemur [1] or an owl.

[1] https://www.cnn.com/travel/article/madagascar-lemurs/index.html

On Mon, Nov 4, 2019 at 7:33 PM Kenneth Knowles  wrote:

> Yes! Let's have a mascot!
>
> Direct connections often have duplicates. For example in the log
> processing space, there is https://www.linkedin.com/in/hooverbeaver
>
> I like a flying squirrel, but Flink already is a squirrel.
>
> Hedgehog? I could not find any source of confusion for this one.
>
> Kenn
>
>
> On Mon, Nov 4, 2019 at 6:02 PM Robert Burke  wrote:
>
>> As both a Canadian, and the resident fan of a programming language with a
>> rodent mascot, I endorse this mascot.
>>
>> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>>
>>> I like it, a beaver could be a cute mascot :)
>>>
>>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi everybody,

 I think the idea of creating a Beam mascot has been brought up a couple
 times here in the past, but I would like us to go through with it this time
 if we are all in agreement:)

 We can brainstorm in this thread what the mascot should be given Beam’s
 characteristics and principles. What do you all think?

 For example, I am proposing a beaver as a mascot, because:
 1. Beavers build dams out of logs for streams
 2. The name is close to Beam
 3. And with the right imagination, you can make a really cute beaver :D
 https://imgur.com/gallery/RLo05M9

 WDYT? If you don’t like the beaver, what are the other options that you
 think could be appropriate? I would like to invite others to propose ideas
 and get the discussions going.

 Thanks,
 Aizhamal

>>>
>> On Mon, Nov 4, 2019, 4:11 PM David Cavazos  wrote:
>>
>>> I like it, a beaver could be a cute mascot :)
>>>
>>> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 Hi everybody,

 I think the idea of creating a Beam mascot has been brought up a couple
 times here in the past, but I would like us to go through with it this time
 if we are all in agreement:)

 We can brainstorm in this thread what the mascot should be given Beam’s
 characteristics and principles. What do you all think?

 For example, I am proposing a beaver as a mascot, because:
 1. Beavers build dams out of logs for streams
 2. The name is close to Beam
 3. And with the right imagination, you can make a really cute beaver :D
 https://imgur.com/gallery/RLo05M9

 WDYT? If you don’t like the beaver, what are the other options that you
 think could be appropriate? I would like to invite others to propose ideas
 and get the discussions going.

 Thanks,
 Aizhamal

>>>

Re: RabbitMqIO issues and open PRs

2019-10-31 Thread Eugene Kirpichov

Regarding review latency, FWIW I'm not super active on Beam these days but
I'll be happy to review 1-2 PRs for RabbitMqIO (I'm @jkff).

On Thu, Oct 31, 2019 at 8:47 PM Kenneth Knowles  wrote:

> Yes, thanks for emailing! We very much value sharing your intentions with
> the community. For small changes or fixes, you can just open a PR. For
> larger changes that could use feedback from the community (versus just the
> code reviewer) this list is the right place to go. If it is truly complex,
> a short document is helpful. I've flipped through your PRs and I do not
> think a design doc is needed.
>
> We have a continuing issue with review latency. It is always good to ping
> the dev list or proposed reviewers. Sorry for this & thanks for such
> substantial contribution!
>
> Kenn
>
> On Thu, Oct 31, 2019 at 1:51 PM Reuven Lax  wrote:
>
>> I think we would be happy to see improvements and contributions to .this
>> component. Emailing this list is definitely the right first step - it gives
>> anyone with knowledge of the RabbitMqIO component a chance to weigh in. You
>> don't necessarily have to talk to component authors before submitting a PR,
>> however it is recommended; that way you're more likely to have an initial
>> PR with the correct approach, instead of having to iterate multiple times
>> on the PR.
>>
>> Reuven
>>
>> On Thu, Oct 31, 2019 at 1:38 PM Daniel Robert 
>> wrote:
>>
>>> I'm pretty new to the Beam ecosystem, so apologies if this is not the
>>> right forum for this.
>>>
>>> My team has been learning and starting to use Beam for the past few
>>> months and have run into myriad problems with the RabbitIO connector for
>>> java, aspects of which seem perhaps fundamentally broken or incorrect in
>>> the released implementation. I've tracked our significant issues down
>>> and opened tickets and PRs for them. I'm not certain what the typical
>>> response time is, but given the severity of the issues (as I perceive
>>> them), I'd like to escalate some of these PRs and try to get some fixes
>>> into the next Beam release.
>>>
>>> I originally opened BEAM-8390 (https://github.com/apache/beam/pull/9782)
>>>
>>> as it was impossible to set the 'useCorrelationId' property (implying
>>> this functionality was also untested). Since then, I've found and PR'd
>>> the following, which are awaiting feedback/approval:
>>>
>>> 1. Watermarks not advancing
>>>
>>> Ticket/PR: BEAM-8347 - https://github.com/apache/beam/pull/9820
>>>
>>> Impact: under low message volumes, the watermark never advances and
>>> windows can never 'on time' fire.
>>>
>>> Notes: The RabbitMq UnboundedSource uses 'oldest known time' as a
>>> watermark when other similar sources (and documentation on watermarking)
>>> state for monotonically increasing timestamps (the case with a queue) it
>>> should be the most recent time. I have a few open questions about
>>> testing and implementation details in the PR but it should work as-is.
>>>
>>> 2. Exchanges are always declared, which fail if a pre-existing exchange
>>> differs
>>>
>>> Ticket/PR: BEAM-8513 - https://github.com/apache/beam/pull/9937
>>>
>>> Impact: It is impossible to utilize an existing, durable exchange.
>>>
>>> Notes: I'm hooking Beam up to an existing topic exchange (an 'event
>>> bus') that is 'durable'. RabbitMqIO current implementation will always
>>> attempt to declare the exchange, and does so as non-durable, which
>>> causes rabbit to fail the declaration. (Interestingly qpid does not fail
>>> in this scenario.) The PR allows the caller to disable declaring the
>>> exchange, similar to `withQueueDeclare` for declaring a queue.
>>>
>>> This PR also calls out a lot of the documentation that seems misleading;
>>> implying that one either interacts with queues *or* exchanges when that
>>> is not how AMQP fundamentally operates. The implementation of the
>>> RabbitMqIO connector before this PR seems like it probably works with
>>> the default exchange and maybe a fanout exchange, but not a topic
>>> exchange.
>>>
>>> 3. Library versions
>>>
>>> Tickets/PR: BEAM-7434, BEAM-5895, and BEAM-5894 -
>>> https://github.com/apache/beam/pull/9900
>>>
>>> Impact: The rabbitmq amqp client for java released the 5.x line in
>>> September of 2017. Some automated tickets were open to upgrade, plus a
>>> manual ticket to drop the use of the deprecated QueueConsumer API.
>>>
>>> Notes: The upgrade was relatively simple, but I implemented it using a
>>> pull-based API rather than push-based (Consumer) which may warrant some
>>> discussion. I'm used to discussing this type of thing over PRs but I'm
>>> happy to do whatever the community prefers.
>>>
>>> ---
>>>
>>> Numbers 1 and 2 above are 'dealbreaker' issues for my team. They
>>> effectively make rabbitmq unusable as an unbounded source, forcing
>>> developers to fork and modify the code. Number 3 is much less
>>> significant and I've put it here more for 'good, clean living' than an
>>> urgent issue.
>>>
>>> Aside from the open issues,

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov

Sorry, I just realized I've made a mistake. BoundedSource in some runners
may not have the same "fits in memory" limitation as DoFn's, so in that
sense you're right - if it was done as a BoundedSource, perhaps it would
work better in your case, even if it didn't read things in parallel.

On Thu, Oct 24, 2019 at 8:17 AM Eugene Kirpichov 
wrote:

> Hi Josef,
>
> JdbcIO per se does not require the result set to fit in memory. The issues
> come from the limitations of the context in which it runs:
> - It indeed uses a DoFn to emit results; a DoFn is in general allowed to
> emit an unbounded number of results that doesn't necessarily have to fit in
> memory, but some runners may have this requirement (e.g. Spark probably
> does, Dataflow doesn't, not sure about the others)
> - JdbcIO uses a database cursor provided by the underlying JDBC driver to
> read through the results. Again, depending on the particular JDBC driver,
> the cursor may or may not be able to stream the results without storing all
> of them in memory.
> - The biggest issue, though, is that there's no way to automatically split
> the execution of a JDBC query into several sub-queries whose results
> together are equivalent to the result of the original query. Because of
> this, it is not possible to implement JdbcIO in a way that it would
> *automatically* avoid scanning through the entire result set, because
> scanning through the entire result set sequentially is the only way JDBC
> drivers (and most databases) allow you to access query results. Even if we
> chose to use BoundedSource, we wouldn't be able to implement the split()
> method.
>
> If you need to read query results in parallel, or to circumvent memory
> limitations of a particular runner or JDBC driver, you can use
> JdbcIO.readAll(), and parameterize your query such that passing all the
> parameter values together adds up to the original query you wanted. Most
> likely it would be something like transforming "SELECT * FROM TABLE" to a
> family of queries "SELECT * FROM TABLE WHERE MY_PRIMARY_KEY BETWEEN ? AND
> ?" and passing primary key ranges adding up to the full range of the
> table's keys.
>
> Note that, whether this performs better, will also depend on the database
> - e.g. if the database is already bottlenecked, then reading from it in
> parallel will not make things faster.
>
> On Thu, Oct 24, 2019 at 7:26 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> JdbcIO is basically a DoFn. So it could load all on a single executor
>> (there's no obvious way to split).
>>
>> It's what you mean ?
>>
>> Regards
>> JB
>>
>> Le 24 oct. 2019 15:26, Jozef Vilcek  a écrit :
>>
>> Hi,
>>
>> I am in a need to read a big-ish data set via JdbcIO. This forced me to
>> bump up memory for my executor (right now using SparkRunner). It seems that
>> JdbcIO has a requirement to fit all data in memory as it is using DoFn to
>> unfold query to list of elements.
>>
>> BoundedSource would not face the need to fit result in memory, but JdbcIO
>> is using DoFn. Also, in recent discussion [1] it was suggested that
>> BoudnedSource should not be used as it is obsolete.
>>
>> Does anyone faced this issue? What would be the best way to solve it? If
>> DoFn should be kept, then I can only think of splitting the query to ranges
>> and try to find most fitting number of rows to read at once.
>>
>> I appreciate any thoughts.
>>
>> [1]
>> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource
>>
>>
>>

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov

I'm actually very surprised why to this day nobody wrote a Python connector
for the Python Database API, like JdbcIO.
Do we maybe have a way to use JdbcIO from Python via the cross-language
connectors stuff?

On Fri, Sep 27, 2019 at 4:28 PM Lucas Magalhães <
lucas.magalh...@paralelocs.com.br> wrote:

> Hi guys.
>
> Sorry. I forgot to mention that.. I'm using python SDK.. Its seems that
> Java SDK looks like more mature, but i have no skill on that language.
>
> I'm trying to extract data from postgres (Cloud SQL), make some
> agregations and save into BigQuery.
>
> Em sex, 27 de set de 2019 19:21, Pablo Estrada 
> escreveu:
>
>> Hi Lucas!
>> Can you share more information about your use case? Java has JdbcIO.
>> Maybe that's all you need? Or perhaps you're using Python SDK?
>> Best
>> -P.
>>
>> On Fri, Sep 27, 2019 at 3:08 PM Eugene Kirpichov 
>> wrote:
>>
>>> Hi Lucas,
>>> Any reason why you can't use JdbcIO?
>>> You almost certainly should *not* use BoundedSource, nor Splittable DoFn
>>> for this. BoundedSource is obsolete in favor of assembling your connector
>>> from regular transforms and/or using an SDF, and SDF is an extremely
>>> advanced feature whose primary audience is Beam SDK authors.
>>>
>>> On Fri, Sep 27, 2019 at 2:52 PM Lucas Magalhães <
>>> lucas.magalh...@paralelocs.com.br> wrote:
>>>
>>>> Hi guys.
>>>>
>>>> I'm new on apache Beam and o would like some help to undestand some
>>>> behaviours.
>>>>
>>>> 1. Is there some performance issue when i'm reading data from a
>>>> relational database using a ParDo instead of BoundedSource?
>>>>
>>>> 2. If I'm going to implement a BoundedSource how does Beam manage the
>>>> connection? I need to open and close in every method, like split, read,
>>>> estimete size and so on??
>>>>
>>>> 3. I read something about splittable dofn but i didnt fine instructions
>>>> about to How implement. Has anyone have something about ir?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov

Hi Lucas,
Any reason why you can't use JdbcIO?
You almost certainly should *not* use BoundedSource, nor Splittable DoFn
for this. BoundedSource is obsolete in favor of assembling your connector
from regular transforms and/or using an SDF, and SDF is an extremely
advanced feature whose primary audience is Beam SDK authors.

On Fri, Sep 27, 2019 at 2:52 PM Lucas Magalhães <
lucas.magalh...@paralelocs.com.br> wrote:

> Hi guys.
>
> I'm new on apache Beam and o would like some help to undestand some
> behaviours.
>
> 1. Is there some performance issue when i'm reading data from a relational
> database using a ParDo instead of BoundedSource?
>
> 2. If I'm going to implement a BoundedSource how does Beam manage the
> connection? I need to open and close in every method, like split, read,
> estimete size and so on??
>
> 3. I read something about splittable dofn but i didnt fine instructions
> about to How implement. Has anyone have something about ir?
>
> Thanks
>
>
>
>

Re: Collecting feedback for Beam usage

2019-09-24 Thread Eugene Kirpichov

Creating a central place for collecting Beam usage sounds compelling, but
we'd have to be careful about several aspects:
- It goes without saying that this can never be on-by-default, even for a
tiny fraction of pipelines.
- For further privacy protection, including the user's PipelineOptions is
probably out of the question too: people might be including very sensitive
data in their PipelineOptions (such as database passwords) and we wouldn't
want to end up storing that data even due to a user's mistake. The only
data that can be stored is data that Beam developers can guarantee is never
sensitive, or data intentionally authored by a human for the purpose of
reporting it to us (e.g. a hand-typed feedback message).
- If it requires the user manually clicking the link, then it will not
collect data about automated invocations of any pipelines, whereas likely
almost all practical invocations are automated - the difference between
COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
- Moreover, many practical invocations likely go through an intermediate
library / product, such as scio or talend. There'd need to be a story for
library developers to offer this capability to their users.
- The condition "was feedback reported for this pipeline", regardless of
whether it is reported manually (by clicking the link) or automatically (by
explicitly enabling some flag), heavily biases the sample - people are
unlikely to click the link if the pipeline works fine (and almost all
production pipelines work fine, otherwise they wouldn't be in production),
and I don't know what considerations would prompt somebody to enable the
flag for a periodic production pipeline. Meaning, the collected data likely
can not be reliably used for any aggregation/counting, except for picking
out interesting individual examples for case studies.
- Measures should be taken to ensure that people don't accidentally enable
it in their quick-running direct runner unit tests, causing lots of traffic.
- I would not dismiss the possibility of spam and attacks.

I'd recommend to start by listing the questions we're hoping to answer
using the collected feedback, and then judging whether the proposed method
indeed allows answering them while respecting the users' privacy.

On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik  wrote:

> One of the options could be to just display the URL and not to phone home.
> I would like it so that users can integrate this into their deployment
> solution so we get regular stats instead of only when a user decides to run
> a pipeline manually.
>
> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
> wrote:
>
>> I think the goal is to lower the barrier of entry. Displaying a URL to
>> click on while waiting for your pipeline to start up, that contains
>> all the data explicitly visible, is about as easy as it gets.
>> Remembering to run a new (probably not as authentic) pipeline with
>> that flag is less so.
>>
>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
>> wrote:
>> >
>> > I'm with Luke on this. We can add a set of flags to send home stats and
>> crash dumps if user agrees. If we keep code isolated, it will be easy
>> enough for user to check what is being sent.
>> >
>> > One more heavy-weight option is to also allow user configure and
>> persist what information he is ok with sharing.
>> >
>> > --Mikhail
>> >
>> >
>> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
>> >>
>> >> Why not add a flag to the SDK that would do the phone home when
>> specified?
>> >>
>> >> From a support perspective it would be useful to know:
>> >> * SDK version
>> >> * Runner
>> >> * SDK provided PTransforms that are used
>> >> * Features like user state/timers/side inputs/splittable dofns/...
>> >> * Graph complexity (# nodes, # branches, ...)
>> >> * Pipeline failed or succeeded
>> >>
>> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
>> wrote:
>> >>> >
>> >>> > Would people actually click on that link though? I think Kyle has a
>> point that in practice users would only find and click on that link when
>> they're having some kind of issue, especially if the link has "feedback" in
>> it.
>> >>>
>> >>> I think the idea is that we would make the link very light-weight,
>> >>> kind of like a survey (but even easier as it's pre-populated).
>> >>> Basically an opt-in phone-home. If we don't collect any personal data
>> >>> (not even IP/geo, just (say) version + runner, all visible in the
>> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
>> >>> think we have to worry about spammers and ballot stuffers given the
>> >>> target audience). If we can catch people while they wait for their
>> >>> pipeline to start up (and/or complete), this is a great time to get
>> >>> some feedback.
>> >>>
>> >>> > I agree usage data would be really valuable, but I'm not sure that
>> this approach would get us good data. Is there a way to get download

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-31 Thread Eugene Kirpichov

I would recommend that the known issue notice about this source at least be
strongly worded - this source in the current state should be marked "DO NOT
USE" - it will produce data loss in *most* production use cases. That still
leaves the risk that people will use it anyway; up to folks driving the
release to decide whether it's worth cutting a new candidate for the sake
of just temporarily removing this source, or a notice in release notes is
sufficient.

On Wed, Jul 31, 2019 at 2:59 PM Ahmet Altay  wrote:

> Since the python mongodb source is new in this release (not a regression)
> and experimental, I agree with adding a known issues notice to the release
> notes instead of starting a RC2 only for this issue.
>
> On Wed, Jul 31, 2019 at 2:47 PM Chamikara Jayalath 
> wrote:
>
>> FYI we found a critical issue with the Python MongoDB source that is
>> included with this release:
>> https://issues.apache.org/jira/browse/BEAM-7866
>> I suggest we include a clear notice in the release about this issue if
>> the release vote has already been finalized or make this a blocker if we
>> are going for a RC2.
>>
>> Thanks,
>> Cham
>>
>> On Wed, Jul 31, 2019 at 2:31 AM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Jul 31, 2019 at 11:22 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 I have checked Portable Wordcount example on Flink and Spark on Python
 2 and Python 3.

 To do so, I had to checkout Beam from git repo, since using the source
 distribution does not include gradlew, and gradelw_orig did not work for
 me. Commands I ran:

 git checkout tags/v2.14.0-RC1
 ./gradlew :sdks:python:container:py3:docker
 ./gradlew :runners:flink:1.5:job-server:runShadow# Use  ./gradlew
 :runners:spark:job-server:runShadow for Spark
 ./gradlew :sdks:python:test-suites:portable:py35:portableWordCountBatch
  -PjobEndpoint=localhost:8099 -PenvironmentType=LOOPBACK
 cat /tmp/py-wordcount-direct* # to verify results.

 Loopback scenarios worked, however DOCKER scenarios did not. Opened
 several Jiras to follow up:

 https://issues.apache.org/jira/browse/BEAM-7857
 https://issues.apache.org/jira/browse/BEAM-7858
 https://issues.apache.org/jira/browse/BEAM-7859
 

>>>
>>> I commented on the bugs, and I think this is due to trying to use Docker
>>> mode with local files (a known issue).
>>>
>>>
 The gradle targets that were required to run these tests are not
 present in 2.13.0 branch, so I don't consider it a regression and still
 cast +1.

>>>
>>> Agreed.
>>>
>>>
 On Wed, Jul 31, 2019 at 11:31 AM Ismaël Mejía 
 wrote:

> Oups Robert pointed to me that I have probably not counted correctly.
> There were indeed already 3 PMC +1 votes. Pablo, Robert and Ahmet.
> Please excuse me for the extra noise.
>
> On Wed, Jul 31, 2019 at 9:46 AM Ismaël Mejía 
> wrote:
> >
> > To complete the release we need to have at least three +1 binding
> > votes (votes from PMC members) as stated in [1]. So far we have only
> > 2.
> >
> > Thomas (and the others). The blog post PR is now open [2] please help
> > us add missing features or maybe to highlight the ones you consider
> > important in the PR comments.
> >
> > Here it is the missing +1 (binding). Validated SHAs+signatures,
> > beam-samples and one internal company project with the new jars.
> > Compared source file vs tagged git repo. Everything looks ok.
> >
> > [1] https://www.apache.org/foundation/voting.html#ReleaseVotes
> > [2] https://github.com/apache/beam/pull/9201/files
> >
> > On Wed, Jul 31, 2019 at 6:27 AM Anton Kedin 
> wrote:
> > >
> > > Ran various postcommits, validates runners, and nexmark against
> the release branch. All looks good so far.
> > >
> > > Will take another look at the docs/blog and the nexmark numbers
> tomorrow, but if nothing comes up I will close the vote tomorrow
> (Wednesday) by 6pm PST (= Thursday 01:00am UTC) since it's over 72hours
> since the vote has started and we have a number of +1s including PMC
> members and no -1s.
> > >
> > > Regards,
> > > Anton
> > >
> > > On Tue, Jul 30, 2019 at 8:13 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> > >>
> > >> I also ran unit tests for Python 3.7 and they passed as well.
> Cython tests for python3.7 require  `apt-get install python3.7-dev`.
> > >>
> > >> On Wed, Jul 31, 2019 at 3:16 AM Pablo Estrada 
> wrote:
> > >>>
> > >>> +1
> > >>>
> > >>> I installed from source, and ran unit tests for Python in 2.7,
> 3.5, 3.6.
> > >>>
> > >>> Also ran a number of integration tests on Py 3.5 on Dataflow and
> DirectRunner.
> > >>> Best
> > >>> -P.
> > >>>
> > >>> On Tue, Jul 30, 2019 at 11:09 AM Hannah

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Eugene Kirpichov

t;>> So I spent one afternoon trying some ideas for reusing the last few
>> transforms WriteFiles.
>> >>>
>> >>> WriteShardsIntoTempFilesFn extends DoFn,
>> Iterable>, FileResult>
>> >>> => GatherResults extends PTransform,
>> PCollection>>
>> >>> => FinalizeTempFileBundles extends
>> PTransform>>,
>> WriteFilesResult>
>> >>>
>> >>> I replaced FileResult with KV
>> so I can use pre-compute SMB destination file names for the transforms.
>> >>> I'm also thinking of parameterizing ShardedKey for SMB's
>> bucket/shard to reuse WriteShardsIntoTempFilesFn. These transforms are
>> private and easy to change/pull out.
>> >>>
>> >>> OTOH they are somewhat coupled with the package private
>> {Avro,Text,TFRecord}Sink and their WriteOperation impl (where the bulk of
>> temp file handing logic lives). Might be hard to decouple either modifying
>> existing code or creating new transforms, unless if we re-write most of
>> FileBasedSink from scratch.
>> >>>
>> >>> Let me know if I'm on the wrong track.
>> >>>
>> >>> WIP Branch https://github.com/spotify/beam/tree/neville/write-files
>> >>>
>> >>> On Tue, Jul 23, 2019 at 4:22 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Jul 22, 2019 at 1:41 PM Robert Bradshaw 
>> wrote:
>> >>>>>
>> >>>>> On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov <
>> kirpic...@google.com> wrote:
>> >>>>> >
>> >>>>> > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >>>>> >>
>> >>>>> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li <
>> neville@gmail.com> wrote:
>> >>>>> >> >
>> >>>>> >> > Thanks Robert. Agree with the FileIO point. I'll look into it
>> and see what needs to be done.
>> >>>>> >> >
>> >>>>> >> > Eugene pointed out that we shouldn't build on
>> FileBased{Source,Sink}. So for writes I'll probably build on top of
>> WriteFiles.
>> >>>>> >>
>> >>>>> >> Meaning it could be parameterized by FileIO.Sink, right?
>> >>>>> >>
>> >>>>> >>
>> https://github.com/apache/beam/blob/release-2.13.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L779
>> >>>>> >
>> >>>>> > Yeah if possible, parameterize FileIO.Sink.
>> >>>>> > I would recommend against building on top of WriteFiles either.
>> FileIO being implemented on top of WriteFiles was supposed to be a
>> temporary measure - the longer-term plan was to rewrite it from scratch
>> (albeit with a similar structure) and throw away WriteFiles.
>> >>>>> > If possible, I would recommend to pursue this path: if there are
>> parts of WriteFiles you want to reuse, I would recommend to implement them
>> as new transforms, not at all tied to FileBasedSink (but ok if tied to
>> FileIO.Sink), with the goal in mind that FileIO could be rewritten on top
>> of these new transforms, or maybe parts of WriteFiles could be swapped out
>> for them incrementally.
>> >>>>>
>> >>>>> Thanks for the feedback. There's a lot that was done, but looking at
>> >>>>> the code it feels like there's a lot that was not yet done either,
>> and
>> >>>>> the longer-term plan wasn't clear (though perhaps I'm just not
>> finding
>> >>>>> the right docs).
>> >>>>
>> >>>>
>> >>>> I'm also a bit unfamiliar with original plans for WriteFiles and for
>> updating source interfaces, but I prefer not significantly modifying
>> existing IO transforms to suite the SMB use-case. If there are existing
>> pieces of code that can be easily re-used that is fine, but existing
>> sources/sinks are designed to perform a PCollection -> file transformation
>> and vice versa with (usually) runner determined sharding. Things specific
>> to SMB such as sharding restrictions, writing metadata to a separate file,
>> reading multiple files from the same abstraction, does not sound like
>> features that should be included in our usual fil

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Eugene Kirpichov

On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw  wrote:

> On Mon, Jul 22, 2019 at 4:04 PM Neville Li  wrote:
> >
> > Thanks Robert. Agree with the FileIO point. I'll look into it and see
> what needs to be done.
> >
> > Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So
> for writes I'll probably build on top of WriteFiles.
>
> Meaning it could be parameterized by FileIO.Sink, right?
>
>
> https://github.com/apache/beam/blob/release-2.13.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L779

Yeah if possible, parameterize FileIO.Sink.
I would recommend against building on top of WriteFiles either. FileIO
being implemented on top of WriteFiles was supposed to be a temporary
measure - the longer-term plan was to rewrite it from scratch (albeit with
a similar structure) and throw away WriteFiles.
If possible, I would recommend to pursue this path: if there are parts of
WriteFiles you want to reuse, I would recommend to implement them as new
transforms, not at all tied to FileBasedSink (but ok if tied to
FileIO.Sink), with the goal in mind that FileIO could be rewritten on top
of these new transforms, or maybe parts of WriteFiles could be swapped out
for them incrementally.


>
>
> > Read might be a bigger change w.r.t. collocating ordered elements across
> files within a bucket and TBH I'm not even sure where to start.
>
> Yeah, here we need an interface that gives us ReadableFile ->
> Iterable. There are existing PTransform,
> PCollection> but such an interface is insufficient to extract
> ordered records per shard. It seems the only concrete implementations
> are based on FileBasedSource, which we'd like to avoid, but there's no
> alternative. An SDF, if exposed, would likely be overkill and
> cumbersome to call (given the reflection machinery involved in
> invoking DoFns).
>
Seems easiest to just define a new regular Java interface for this.
Could be either, indeed, ReadableFile -> Iterable, or something
analogous, e.g. (ReadableFile, OutputReceiver) -> void. Depends on how
much control over iteration you need.
And yes, DoFn's including SDF's are not designed to be used as Java
interfaces per se. If you need DoFn machinery in this interface (e.g. side
inputs), use Contextful - s.apache.org/context-fn.


>
> > I'll file separate PRs for core changes needed for discussion. WDYT?
>
> Sounds good.
>
> > On Mon, Jul 22, 2019 at 4:20 AM Robert Bradshaw 
> wrote:
> >>
> >> On Fri, Jul 19, 2019 at 5:16 PM Neville Li 
> wrote:
> >> >
> >> > Forking this thread to discuss action items regarding the change. We
> can keep technical discussion in the original thread.
> >> >
> >> > Background: our SMB POC showed promising performance & cost saving
> improvements and we'd like to adopt it for production soon (by EOY). We
> want to contribute it to Beam so it's better generalized and maintained. We
> also want to avoid divergence between our internal version and the PR while
> it's in progress, specifically any breaking change in the produced SMB data.
> >>
> >> All good goals.
> >>
> >> > To achieve that I'd like to propose a few action items.
> >> >
> >> > 1. Reach a consensus about bucket and shard strategy, key handling,
> bucket file and metadata format, etc., anything that affect produced SMB
> data.
> >> > 2. Revise the existing PR according to #1
> >> > 3. Reduce duplicate file IO logic by reusing FileIO.Sink,
> Compression, etc., but keep the existing file level abstraction
> >> > 4. (Optional) Merge code into extensions::smb but mark clearly as
> @experimental
> >> > 5. Incorporate ideas from the discussion, e.g. ShardingFn,
> GroupByKeyAndSortValues, FileIO generalization, key URN, etc.
> >> >
> >> > #1-4 gives us something usable in the short term, while #1 guarantees
> that production data produced today are usable when #5 lands on master. #4
> also gives early adopters a chance to give feedback.
> >> > Due to the scope of #5, it might take much longer and a couple of big
> PRs to achieve, which we can keep iterating on.
> >> >
> >> > What are your thoughts on this?
> >>
> >> I would like to see some resolution on the FileIO abstractions before
> >> merging into experimental. (We have a FileBasedSink that would mostly
> >> already work, so it's a matter of coming up with an analogous Source
> >> interface.) Specifically I would not want to merge a set of per file
> >> type smb IOs without a path forward to this or the determination that
> >> it's not possible/desirable.
>

Re: pubsub -> IO

2019-07-17 Thread Eugene Kirpichov

I think full-blown SDF is not needed for this - someone just needs to
implement a MongoDbIO.readAll() variant, using a composite transform. The
regular pattern for this sort of thing will do (ParDo split, reshuffle,
ParDo read).
Whether it's worth replacing MongoDbIO.read() with a redirect to readAll()
is another matter - size estimation, available only in BoundedSource for
now, may or may not be important.

On Wed, Jul 17, 2019 at 2:39 AM Ryan Skraba  wrote:

> Hello!  To clarify, you want to do something like this?
>
> PubSubIO.read() -> extract mongodb collection and range ->
> MongoDbIO.read(collection, range) -> ...
>
> If I'm not mistaken, it isn't possible with the implementation of
> MongoDbIO (based on BoundedSource interface, requiring the collection to be
> specified once at pipeline construction time).
>
> BUT -- this is a good candidate for an improvement in composability, and
> the ongoing work to prefer the SDF for these types of use cases.   Maybe
> raise a JIRA for an improvement?
>
> All my best, Ryan
>
>
> On Wed, Jul 17, 2019 at 9:35 AM Chaim Turkel  wrote:
>
>> any ideas?
>>
>> On Mon, Jul 15, 2019 at 11:04 PM Rui Wang  wrote:
>> >
>> > +u...@beam.apache.org
>> >
>> >
>> > -Rui
>> >
>> > On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel  wrote:
>> >>
>> >> Hi,
>> >>   I am looking to write a pipeline that read from a mongo collection.
>> >>   I would like to listen to a pubsub that will have a object that will
>> >> tell me which collection and which time frame.
>> >>   Is there a way to do this?
>> >>
>> >> Chaim
>> >>
>> >> --
>> >>
>> >>
>> >> Loans are funded by
>> >> FinWise Bank, a Utah-chartered bank located in Sandy,
>> >> Utah, member FDIC, Equal
>> >> Opportunity Lender. Merchant Cash Advances are
>> >> made by Behalf. For more
>> >> information on ECOA, click here
>> >> . For important information about
>> >> opening a new
>> >> account, review Patriot Act procedures here
>> >> .
>> >> Visit Legal
>> >>  to
>> >> review our comprehensive program terms,
>> >> conditions, and disclosures.
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> . For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> .
>> Visit Legal
>>  to
>> review our comprehensive program terms,
>> conditions, and disclosures.
>>
>

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-16 Thread Eugene Kirpichov

I'd like to reiterate the request to not build anything on top of
FileBasedSource/Reader.
If the design requires having some interface for representing a function
from a filename to a stream of records, better introduce a new interface
for that.
If it requires interoperability with other IOs that read files, better
change them to use the new interface.

On Tue, Jul 16, 2019 at 9:08 AM Chamikara Jayalath 
wrote:

> Thanks this clarifies a lot.
>
> For writer, I think it's great if you can utilize existing FileIO.Sink
> implementations even if you have to reimplement some of the logic (for
> example compression, temp file handling) that is already implemented in
> Beam FileIO/WriteFiles transforms in your SMB sink transform.
>
> For reader, you are right that there's no FileIO.Read. What we have are
> various implementations of FileBasedSource/FileBasedReader classes that are
> currently intentionally hidden since Beam IO transforms are expected to be
> the intended public interface for users. If you can expose and re-use these
> classes with slight modifications (keeping backwards compatibility) I'm OK
> with it. Otherwise you'll have to write your own reader implementations.
>
> In general, seems like SMB has very strong requirements related to
> sharding/hot-key management that are not easily achievable by implementing
> SMB source/sink as a composite transform that utilizes existing source/sink
> transforms. This forces you to implement this logic in your own DoFns and
> existing Beam primitives are not easily re-usable in this context.
>
> Thanks,
> Cham
>
> On Tue, Jul 16, 2019 at 8:26 AM Neville Li  wrote:
>
>> A little clarification of the IO requirement and my understanding of the
>> current state of IO.
>>
>> tl;dr: not sure if there're reusable bits for the reader. It's possible
>> to reuse some for the writer but with heavy refactoring.
>>
>> *Reader*
>>
>>- For each bucket (containing the same key partition, sorted) across
>>multiple input data sets, we stream records from bucket files and merge
>>sort.
>>- We open the files in a DoFn, and emit KV where the
>>CGBKR encapsulates Iterable from each input.
>>- Basically we need a simple API like ResourceId -> Iterator, i.e.
>>sequential read, no block/offset/split requirement.
>>- FileBasedSource.FileBasedReader seems the closest fit but they're
>>nested & decoupled.
>>- There's no FileIO.Read, only a ReadMatches[1], which can be used
>>with ReadAllViaFileBasedSource. But that's not the granularity we need,
>>since we lose ordering of the input records, and can't merge 2+ sources.
>>
>> *Writer*
>>
>>- We get a `PCollection>` after bucket and
>>and sort, where Iterable is the records sorted by key and BucketShardId
>>is used to produce filename, e.g. bucket-1-shard-2.avro.
>>- We write each Iterable to a temp file and move to final
>>destination when done. Both should ideally reuse existing code.
>>- Looks like FileIO.Sink (and impls in AvroIO, TextIO, TFRecordIO)
>>supports record writing into a WritableByteChannel, but some logic like
>>compression is handled in FileIO through ViaFileBasedSink which extends
>>FileBasedSink.
>>- FileIO uses WriteFiles[3] to shard and write of PCollection.
>>Again we lose ordering of the output records or custom file naming scheme.
>>However, WriteShardsIntoTempFilesFn[4] and FinalizeTempFileBundles[5] in
>>WriteFiles seem closest to our need but would have to be split out and
>>generalized.
>>
>> *Note on reader block/offset/split requirement*
>>
>>- Because of the merge sort, we can't split or offset seek a bucket
>>file. Because without persisting the offset index of a key group 
>> somewhere,
>>we can't efficiently skip to a key group without exhausting the previous
>>ones. Furthermore we need to merge sort and align keys from multiple
>>sources, which may not have the same key distribution. It might be 
>> possible
>>to binary search for matching keys but that's extra complication. IMO the
>>reader work distribution is better solved by better bucket/shard strategy
>>in upstream writer.
>>
>> *References*
>>
>>1. ReadMatches extends PTransform,
>>PCollection>
>>    2. ReadAllViaFileBasedSource extends
>>PTransform, PCollection>
>>3. WriteFiles extends
>>PTransform, WriteFilesResult>
>>4. WriteShardsIntoTempFilesFn extends DoFn,
>>Iterable>, FileResult>
>>5. Finali

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Eugene Kirpichov

Quick note: I didn't look through the document, but please do not build on
either FileBasedSink or FileBasedReader. They are both remnants of the old,
non-composable IO world; and in fact much of the composable IO work emerged
from frustration with their limitations and recognizing that many other IOs
were suffering from the same limitations.
Instead of FileBasedSink, build on FileIO.write; instead of
FileBasedReader, build on FileIO.read.

On Mon, Jul 15, 2019 at 9:01 AM Gleb Kanterov  wrote:

> I share the same concern with Robert regarding re-implementing parts of
> IO. At the same time, in the past, I worked on internal libraries that try
> to re-use code from existing IO, and it's hardly possible because it feels
> like it wasn't designed for re-use. There are a lot of classes that are
> nested (non-static) or non-public. I can understand why they were made
> non-public, it's a hard abstraction to design well and keep compatibility.
> As Neville mentioned, decoupling readers and writers would not only benefit
> for this proposal but for any other use-case that has to deal with
> low-level API such as FileSystem API, that is hardly possible today without
> copy-pasting,
>
>
>
>
>
> On Mon, Jul 15, 2019 at 5:05 PM Neville Li  wrote:
>
>> Re: avoiding mirroring IO functionality, what about:
>>
>> - Decouple the nested FileBasedSink.Writer and
>> FileBasedSource.FileBasedReader, make them top level and remove references
>> to parent classes.
>> - Simplify the interfaces, while maintaining support for block/offset
>> read & sequential write.
>> - As a bonus, the refactored IO classes can be used standalone in case
>> when the user wants to perform custom IO in a DoFn, i.e. a
>> PTransform, PCollection>>. Today
>> this requires a lot of copy-pasted Avro boilerplate.
>> - For compatibility, we can delegate to the new classes from the old ones
>> and remove them in the next breaking release.
>>
>> Re: WriteFiles logic, I'm not sure about generalizing it, but what about
>> splitting the part handling writing temp files into a new
>> PTransform>>,
>> PCollection>>? That splits the bucket-shard
>> logic from actual file IO.
>>
>> On Mon, Jul 15, 2019 at 10:27 AM Robert Bradshaw 
>> wrote:
>>
>>> I agree that generalizing the existing FileIO may not be the right
>>> path forward, and I'd only make their innards public with great care.
>>> (Would this be used like like
>>> SmbSink(MyFileIO.sink(parameters).getWriter[Factory]())?) SMB is a bit
>>> unique that the source and sink are much more coupled than other
>>> sources and sinks (which happen to be completely independent, if
>>> complementary implementations, whereas SMB attempts to be a kind of
>>> pipe where one half is instanciated in each pipeline).
>>>
>>> In short, an SMB source/sink that is parameterized by an arbitrary,
>>> existing IO would be ideal (but possibly not feasible (per existing
>>> prioritizations)), or an SMB source/sink that works as a pair. What
>>> I'd like to avoid is a set of parallel SMB IO classes that (partially,
>>> and incompletely) mirror the existing IO ones (from an API
>>> perspective--how much implementation it makes sense to share is an
>>> orthogonal issue that I'm sure can be worked out.)
>>>
>>> On Mon, Jul 15, 2019 at 4:18 PM Neville Li 
>>> wrote:
>>> >
>>> > Hi Robert,
>>> >
>>> > I agree, it'd be nice to reuse FileIO logic of different file types.
>>> But given the current code structure of FileIO & scope of the change, I
>>> feel it's better left for future refactor PRs.
>>> >
>>> > Some thoughts:
>>> > - SMB file operation is simple single file sequential reads/writes,
>>> which already exists as Writer & FileBasedReader but are private inner
>>> classes, and have references to the parent Sink/Source instance.
>>> > - The readers also have extra offset/split logic but that can be
>>> worked around.
>>> > - It'll be nice to not duplicate temp->destination file logic but
>>> again WriteFiles is assuming a single integer shard key, so it'll take some
>>> refactoring to reuse it.
>>> >
>>> > All of these can be done in backwards compatible way. OTOH
>>> generalizing the existing components too much (esp. WriteFiles, which is
>>> already complex) might lead to two logic paths, one specialized for the SMB
>>> case. It might be easier to decouple some of them for better reuse. But
>>> again I feel it's a separate discussion.
>>> >
>>> > On Mon, Jul 15, 2019 at 9:45 AM Claire McGinty <
>>> claire.d.mcgi...@gmail.com> wrote:
>>> >>
>>> >> Thanks Robert!
>>> >>
>>> >> We'd definitely like to be able to re-use existing I/O
>>> components--for example the Writer>> OutputT>/FileBasedReader (since they operate on a
>>> WritableByteChannel/ReadableByteChannel, which is the level of granularity
>>> we need) but the Writers, at least, seem to be mostly private-access. Do
>>> you foresee them being made public at any point?
>>> >>
>>> >> - Claire
>>> >>
>>> >> On Mon, Jul 15, 2019 at 9:31 AM Robert Bradshaw 
>>> wrote:
>>> >>>
>>>

Re: pipeline status tracing

2019-06-24 Thread Eugene Kirpichov

I see. This was discussed in this thread [1] - however looks like nobody
has picked it up yet.

[1]
https://lists.apache.org/thread.html/235b1f683ef0eee66934fdf0908b3b85ed4e7d3627cbcfcf12b27de4@%3Cuser.beam.apache.org%3E

On Sun, Jun 23, 2019 at 4:06 AM Chaim Turkel  wrote:

> i am writing to bigquery
>
> On Fri, Jun 21, 2019 at 7:12 PM Alexey Romanenko
>  wrote:
> >
> > I see that similar questions happen quite often in the last time, so,
> probably, it would make sense to add this “Wait.on()”-pattern to
> corresponding website documentation page [1].
> >
> > [1]
> https://beam.apache.org/documentation/patterns/file-processing-patterns/
> >
> > On 20 Jun 2019, at 23:00, Eugene Kirpichov  wrote:
> >
> > If you're writing to files, you can already do this: FileIO.write()
> returns WriteFilesResult that you can use in combination with Wait.on() and
> JdbcIO.write() to write something to a database afterwards.
> > Something like:
> >
> > PCollection<..> writeResult = data.apply(FileIO.write()...);
> >
> Create.of("dummy").apply(Wait.on(writeResult)).apply(JdbcIO.write(...write
> to database...))
> >
> >
> >
> > On Wed, Jun 19, 2019 at 11:49 PM Chaim Turkel  wrote:
> >>
> >> is it something you can do, or point me in the right direction?
> >> chaim
> >>
> >> On Wed, Jun 19, 2019 at 8:36 PM Kenneth Knowles 
> wrote:
> >> >
> >> > This sounds like a fairly simply and useful change to the IO to
> output information about where it has written. It seems generally useful to
> always output something like this, instead of PDone.
> >> >
> >> > Kenn
> >> >
> >> > On Wed, Jun 19, 2019 at 5:42 AM Chaim Turkel 
> wrote:
> >> >>
> >> >> Hi,
> >> >>   I would like to write a status at the end of my pipline. I had
> >> >> written about this in the past, and wanted to know if there are any
> >> >> new ideas.
> >> >> The problem is the end of the pipeline return PDone, and i can't do
> >> >> anything with this.
> >> >> So my senario is after i export data from mongo to google storage i
> >> >> want to write to a db that the job was done with some extra
> >> >> information.
> >> >>
> >> >> Chaim Turkel
> >> >>
> >> >> --
> >> >>
> >> >>
> >> >> Loans are funded by
> >> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> >> Utah, member FDIC, Equal
> >> >> Opportunity Lender. Merchant Cash Advances are
> >> >> made by Behalf. For more
> >> >> information on ECOA, click here
> >> >> <https://www.behalf.com/legal/ecoa/>. For important information
> about
> >> >> opening a new
> >> >> account, review Patriot Act procedures here
> >> >> <https://www.behalf.com/legal/patriot/>.
> >> >> Visit Legal
> >> >> <https://www.behalf.com/legal/> to
> >> >> review our comprehensive program terms,
> >> >> conditions, and disclosures.
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> <https://www.behalf.com/legal/ecoa/>. For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> <https://www.behalf.com/legal/patriot/>.
> >> Visit Legal
> >> <https://www.behalf.com/legal/> to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
> >
> >
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>

Re: pipeline status tracing

2019-06-20 Thread Eugene Kirpichov

If you're writing to files, you can already do this: FileIO.write() returns
WriteFilesResult that you can use in combination with Wait.on() and
JdbcIO.write() to write something to a database afterwards.
Something like:

PCollection<..> writeResult = data.apply(FileIO.write()...);
Create.of("dummy").apply(Wait.on(writeResult)).apply(JdbcIO.write(...write
to database...))



On Wed, Jun 19, 2019 at 11:49 PM Chaim Turkel  wrote:

> is it something you can do, or point me in the right direction?
> chaim
>
> On Wed, Jun 19, 2019 at 8:36 PM Kenneth Knowles  wrote:
> >
> > This sounds like a fairly simply and useful change to the IO to output
> information about where it has written. It seems generally useful to always
> output something like this, instead of PDone.
> >
> > Kenn
> >
> > On Wed, Jun 19, 2019 at 5:42 AM Chaim Turkel  wrote:
> >>
> >> Hi,
> >>   I would like to write a status at the end of my pipline. I had
> >> written about this in the past, and wanted to know if there are any
> >> new ideas.
> >> The problem is the end of the pipeline return PDone, and i can't do
> >> anything with this.
> >> So my senario is after i export data from mongo to google storage i
> >> want to write to a db that the job was done with some extra
> >> information.
> >>
> >> Chaim Turkel
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> . For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> .
> >> Visit Legal
> >>  to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> . For important information about
> opening a new
> account, review Patriot Act procedures here
> .
> Visit Legal
>  to
> review our comprehensive program terms,
> conditions, and disclosures.
>

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Eugene Kirpichov

It looks like you want to take a PCollection of lists of items of the same
type (but not necessarily of the same length - in your example you pad them
to the same length but that's unnecessary), induce an undirected graph on
them where there's an edge between XS and YS if they have an element in
common*, and compute connected components on that graph.

*you can make the problem somewhat simpler if for each list you also add
nodes for its individual elements + edges from element to list. Then the
total size of the graph increases only linearly, and the connected
components are the same, but the number of edges is no longer quadratic in
the worst case.

This looks like a really specialized use case (I've never seen a
sufficiently similar problem in my career) so contributing it to the Beam
SDK might not be the best way to go, unless more people chime in that
they'd find it useful.

Unfortunately it is also likely not possible to implement in a scalable way
using Beam primitives, because Beam does not yet support iterative
computations, and computing connected components provably requires at least
O(log N) iterations. It is also easy to prove that your original problem
can not be solved faster: any connected components problem can be reduced
to yours by creating a PCollection with 1 element per edge, where the
element is {source, target}.

If you can elaborate why you think you need this algorithm, the community
might help you find a different way to accomplish the original task.

On Fri, Jun 7, 2019 at 12:14 PM Jan Lukavský  wrote:

> Hi,
>
> that sounds interesting, but it seems to be computationally intensive
> and might not be well scalable, if I understand it correctly. It looks
> like it needs a transitive closure, am I right?
>
>   Jan
>
> On 6/7/19 11:17 AM, i.am.moai wrote:
> > Hello everyone, nice to meet you
> >
> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
> > python as my favorite language .
> >
> > I have no experience with OSS development, but as I use DataFlow at
> > work, I want to contribute to the development of Beam.
> >
> > In fact, there is a feature I want to develop, and now I have the
> > source code on my local PC.
> >
> > The feature I want to create is an extension of GroupBy to a multiple
> > key, which realizes more complex grouping.
> >
> > https://issues.apache.org/jira/browse/BEAM-7358
> >
> > Everyone, could you give me an opinion on this intent?
> >
>

Re: Wait on JdbcIO write completion

2019-02-20 Thread Eugene Kirpichov

Hi Jonathan,

Wait.on() requires a PCollection - it is not possible to change it to wait
on PDone because all PDone's in the pipeline are the same so it's not clear
what exactly you'd be waiting on.

To use the Wait transform with JdbcIO.write(), you would need to change
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L761-L762
to
simply "return input.apply(ParDo.of(...))" and propagate that into the type
signature. Then you'd get a waitable PCollection.

This is a very simple, but backwards-incompatible change. Up to the Beam
community whether/when people would want to make it.

It's also possible to make a slightly larger but compatible change, where
JdbcIO.write() would stay as is, but you could write e.g.
"JdbcIO.write().withResults()" which would be a new transform that *does*
return results and is waitable. A similar approach is taken in
TextIO.write().withOutputFilenames().

On Wed, Feb 20, 2019 at 4:58 AM Jonathan Perron 
wrote:

> Hello folks,
>
> I am meeting a special case where I need to wait for a JdbcIO.write()
> operation to be complete to start a second one.
>
> In the details, I have a PCollection> which is used
> to fill two different SQL statement. It is used in a first
> JdbcIO.write() operation to store anonymized user in a table (userId
> with an associated userUuid generated with UUID.randomUUID()). These two
> parameters have a unique constraint, meaning that a userId cannot have
> multiple userUuid. Unfortunately, on several runs of my pipeline, the
> UUID will be different, meaning that I need to query this table at some
> point, or to use what I describe in the following.
>
> I am planning to fill a second table with this userUuid with a couple of
> others information such as the time of first visit. To limit I/O and as
> I got a lot of information in my PCollection, I want to use it once more
> with a different SQL statement, where the userUuid is read from the
> first table using a SELECT statement. This cannot work if the first
> JdbcIO.write() operation is not complete.
>
> I saw that the Java SDK proposes a Wait.on() PTransform, but it is
> unfortunately only compatible with PCollection, and not a PDone such as
> the one output from the JdbcIO operation. Could my issue be solved by
> expanding the Wait.On() or should I go with an other solution ? If so,
> how could I implement it ?
>
> Many thanks for your input !
>
> Jonathan
>
>

Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

2019-01-30 Thread Eugene Kirpichov

TextIO.read() and AvroIO.read() indeed perform better than match() +
readMatches() + readFiles(), due to DWR - so for these two in particular I
would not recommend such a refactoring.
However, new file-based IOs that do not support DWR should only provide
readFiles(). Those that do, should provide read() and readFiles(). When SDF
supports DWR, then readFiles() will be enough in all cases.
In general there's no need for readAll() for new file-based IOs - it is
always equivalent to matchAll() + readMatches() + readFiles() including
performance-wise. It was included in TextIO/AvroIO before readFiles() was a
thing.

On Wed, Jan 30, 2019 at 2:41 PM Chamikara Jayalath 
wrote:

> On Wed, Jan 30, 2019 at 2:37 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía  wrote:
>>
>>> Ups slight typo, in the first line of the previous email I meant read
>>> instead of readAll
>>>
>>> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía  wrote:
>>> >
>>> > Reuven is right for the example, readAll at this moment may be faster
>>> > and also supports Dynamic Work Rebalancing (DWR), but the performance
>>> > of the other approach may (and must) be improved to be equal, once the
>>> > internal implementation of TextIO.read moves to a SDF version instead
>>> > of the FileBasedSource one, and once that runners support DWR through
>>> > SDF. Of course all of this is future work. Probably Eugene can
>>> > eventually chime in to give more details in practical performance in
>>> > his tests in Dataflow.
>>> >
>>> > Really interesting topic, but I want to bring back the discussion to
>>> > the subject of the thread. I think there is some confusion after
>>> > Jeff's example which should have been:
>>> >
>>> >   return input
>>> >   .apply(TextIO.readAll());
>>> >
>>> > to:
>>> >
>>> >   return input
>>> >   .apply(FileIO.match().filepattern(fileSpec))
>>> >   .apply(FileIO.readMatches())
>>> >   .apply(TextIO.readFiles());
>>> >
>>> > This is the question we are addressing, do we need a readAll transform
>>> > that replaces the 3 steps or no?
>>>
>>
>> Ismaël, I'm not quite sure how these two are equal. readFiles() transform
>> returns a PCollection of ReadableFile objects. Users are expected to read
>> these files in a subsequent ParDo and produce a PCollection of proper type.
>> FooIO.ReadAll() transforms on the other hand are tailored to each IO
>> connector and return a PCollection of objects of type that are supported to
>> be returned by that IO connector.
>>
>
> I assume you meant FileIO.readFiles()  here. Or did you mean
> TextIO.readFiles() ? If so that seems very similar to TextIO.readAll().
>
>>
>>
>>
>>> >
>>> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw 
>>> wrote:
>>> > >
>>> > > Yes, this is precisely the goal of SDF.
>>> > >
>>> > >
>>> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles 
>>> wrote:
>>> > > >
>>> > > > So is the latter is intended for splittable DoFn but not yet using
>>> it? The promise of SDF is precisely this composability, isn't it?
>>> > > >
>>> > > > Kenn
>>> > > >
>>> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas 
>>> wrote:
>>> > > >>
>>> > > >> Reuven - Is TextIO.read().from() a more complex case than the
>>> topic Ismaël is bringing up in this thread? I'm surprised to hear that the
>>> two examples have different performance characteristics.
>>> > > >>
>>> > > >> Reading through the implementation, I guess the fundamental
>>> difference is whether a given configuration expands to TextIO.ReadAll or to
>>> io.Read. AFAICT, that detail and the subsequent performance impact is not
>>> documented.
>>> > > >>
>>> > > >> If the above is correct, perhaps it's an argument for IOs to
>>> provide higher-level methods in cases where they can optimize performance
>>> compared to what a user might naively put together.
>>> > > >>
>>> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax 
>>> wrote:
>>> > > >>>
>>> > > >>> Jeff, what you did here is not simply a refactoring. These two
>>> are quite different, and will likely have different performance
>>> characteristics.
>>> > > >>>
>>> > > >>> The first evaluates the wildcard, and allows the runner to pick
>>> appropriate bundling. Bundles might contain multiple files (if they are
>>> small), and the runner can split the files as appropriate. In the case of
>>> the Dataflow runner, these bundles can be further split dynamically.
>>> > > >>>
>>> > > >>> The second chops of the files inside the the PTransform, and
>>> processes each chunk in a ParDo. TextIO.readFiles currently chops up each
>>> file into 64mb chunks (hardcoded), and then processes each chunk in a ParDo.
>>> > > >>>
>>> > > >>> Reuven
>>> > > >>>
>>> > > >>>
>>> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas 
>>> wrote:
>>> > > 
>>> > >  I would prefer we move towards option [2]. I just tried the
>>> following refactor in my own code from:
>>> > > 
>>> > >    return input
>>> > >

Re: FileIOTest.testMatchWatchForNewFiles flakey in java presubmit

2019-01-22 Thread Eugene Kirpichov

Yeah the "List expected" is constructed
from Files.getLastModifiedTime() calls before the files are actually
modified, the code is basically unconditionally broken rather than merely
flaky.

There's several easy options:
1) Use PAssert.that().satisfies() instead of .contains(), and use
assertThat().contains() inside that, with the list constructed at time the
assertion is applied rather than declared.
2) Implement a Matcher that ignores last modified time and use
that

Jeff - your option #3 is unfortunately also race-prone, because the code
may match the files after they have been written but before
setLastModifiedTime was called.

On Tue, Jan 22, 2019 at 5:08 PM Jeff Klukas  wrote:

> Another option:
>
> #3 Have the writer thread call Files.setLastModifiedTime explicitly after
> each File.write. Then the lastModifiedMillis can be a stable value for each
> file and we can use those same static values in our expected result. I
> think that would also eliminate the race condition.
>
> On Tue, Jan 22, 2019 at 7:48 PM Alex Amato  wrote:
>
>> Thanks Udi, is there a good example for either of these?
>> #1 - seems like you have to rewrite your assertion logic without the
>> PAssert? Is there some way to capture the pipeline output and iterate over
>> it? The pattern I have seen for this in the past also has thread safety
>> issues (Using a DoFn at the end of the pipeline to add the output to a
>> collection is not safe since the collection can be executed concurrently)
>> #2 - Would BigqueryMatcher be a good example for this? which is used in
>> BigQueryTornadoesIT.java Or is there another example you would suggest
>> looking at for reference?
>>
>>- I guess to this you need to implement the SerializableMatcher
>>interface and use the matcher as an option in the pipeline options.
>>
>>
>> On Tue, Jan 22, 2019 at 4:28 PM Udi Meiri  wrote:
>>
>>> Some options:
>>> - You could wait to assert until after p.waitForFinish().
>>> - You could PAssert using SerializableMatcher and allow any
>>> lastModifiedTime.
>>>
>>> On Tue, Jan 22, 2019 at 3:56 PM Alex Amato  wrote:
>>>
 +Jeff, Eugene,

 Hi Jeff and Eugene,

 I've noticed that Jeff's PR

  introduced
 a race condition in this test, but its not clear exactly how to add Jeff's
 test check in a thread safe way. I believe this to be the source of the
 flakeyness Do you have any suggestions Eugene (since you authored this
 test)?

 I added some details to this JIRA issue explaining in full
 https://jira.apache.org/jira/browse/BEAM-6491?filter=-2

 On Tue, Jan 22, 2019 at 3:34 PM Alex Amato  wrote:

> I've seen this fail in a few different PRs for different contributors,
> and its causing some issues during the presubmit process.. This is a
> multithreadred test with a lot of sleeps, so it looks a bit suspicious as
> the source of the problem.
>
> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3688/testReport/org.apache.beam.sdk.io/FileIOTest/testMatchWatchForNewFiles/
>
> I filed a JIRA for this issue:
> https://jira.apache.org/jira/browse/BEAM-6491?filter=-2
>
>
>

Re: SplittableDoFn

2018-10-02 Thread Eugene Kirpichov

Very cool, thanks Alex!

On Tue, Oct 2, 2018 at 2:19 PM Alex Van Boxel  wrote:

> Don't want to crash the tech discussion here, but... I just gave a session
> at the Beam Summit about Splittable DoFn's as a users perspective (from
> things I could gather from the documentation and experimentation). Her is
> the slides deck, maybe it could be useful:
> https://docs.google.com/presentation/d/1dSc6oKh5pZItQPB_QiUyEoLT2TebMnj-pmdGipkVFPk/edit?usp=sharing
>  (quite
> proud of the animations though ;-)
>
>  _/
>
> _/ Alex Van Boxel
>
>
> On Thu, Sep 27, 2018 at 12:04 AM Lukasz Cwik  wrote:
>
>> Reuven, just inside the restriction tracker itself which is scoped per
>> executing SplittableDoFn. A user could incorrectly write the
>> synchronization since they are currently responsible for writing it though.
>>
>> On Wed, Sep 26, 2018 at 2:51 PM Reuven Lax  wrote:
>>
>>> is synchronization over an entire work item, or just inside restriction
>>> tracker? my concern is that some runners (especially streaming runners)
>>> might have hundreds or thousands of parallel work items being processed for
>>> the same SDF (for different keys), and I'm afraid of creating
>>> lock-contention bottlenecks.
>>>
>>> On Fri, Sep 21, 2018 at 3:42 PM Lukasz Cwik  wrote:
>>>
 The synchronization is related to Java thread safety since there is
 likely to be concurrent access needed to a restriction tracker to properly
 handle accessing the backlog and splitting concurrently from when the users
 DoFn is executing and updating the restriction tracker. This is similar to
 the Java thread safety needed in BoundedSource and UnboundedSource for
 fraction consumed, backlog bytes, and splitting.

 On Fri, Sep 21, 2018 at 2:38 PM Reuven Lax  wrote:

> Can you give details on what the synchronization is per? Is it per
> key, or global to each worker?
>
> On Fri, Sep 21, 2018 at 2:10 PM Lukasz Cwik  wrote:
>
>> As I was looking at the SplittableDoFn API while working towards
>> making a proposal for how the backlog/splitting API could look, I found
>> some sharp edges that could be improved.
>>
>> I noticed that:
>> 1) We require users to write thread safe code, this is something that
>> we haven't asked of users when writing a DoFn.
>> 2) We "internal" methods within the RestrictionTracker that are not
>> meant to be used by the runner.
>>
>> I can fix these issues by giving the user a forwarding restriction
>> tracker[1] that provides an appropriate level of synchronization as 
>> needed
>> and also provides the necessary observation hooks to see when a claim
>> failed or succeeded.
>>
>> This requires a change to our experimental API since we need to pass
>> a RestrictionTracker to the @ProcessElement method instead of a sub-type 
>> of
>> RestrictionTracker.
>> @ProcessElement
>> processElement(ProcessContext context, OffsetRangeTracker tracker) {
>> ... }
>> becomes:
>> @ProcessElement
>> processElement(ProcessContext context,
>> RestrictionTracker tracker) { ... }
>>
>> This provides an additional benefit that it prevents users from
>> working around the RestrictionTracker APIs and potentially making
>> underlying changes to the tracker outside of the tryClaim call.
>>
>> Full implementation is available within this PR[2] and was wondering
>> what people thought.
>>
>> 1:
>> https://github.com/apache/beam/pull/6467/files#diff-ed95abb6bc30a9ed07faef5c3fea93f0R72
>> 2: https://github.com/apache/beam/pull/6467
>>
>>
>> On Mon, Sep 17, 2018 at 12:45 PM Lukasz Cwik 
>> wrote:
>>
>>> The changes to the API have not been proposed yet. So far it has all
>>> been about what is the representation and why.
>>>
>>> For splitting, the current idea has been about using the backlog as
>>> a way of telling the SplittableDoFn where to split, so it would be in 
>>> terms
>>> of whatever the SDK decided to report.
>>> The runner always chooses a number for backlog that is relative to
>>> the SDKs reported backlog. It would be upto the SDK to round/clamp the
>>> number given by the Runner to represent something meaningful for itself.
>>> For example if the backlog that the SDK was reporting was bytes
>>> remaining in a file such as 500, then the Runner could provide some 
>>> value
>>> like 212.2 which the SDK would then round to 212.
>>> If the backlog that the SDK was reporting 57 pubsub messages, then
>>> the Runner could provide a value like 300 which would mean to read 57
>>> values and then another 243 as part of the current restriction.
>>>
>>> I believe that BoundedSource/UnboundedSource will have wrappers
>>> added that provide a basic SplittableDoFn implementation so existing IOs
>>> should be migrated over without API changes.
>>>
>>>

Re: Modular IO presentation at Apachecon

2018-09-27 Thread Eugene Kirpichov

Thanks Ismael and everyone else! Unfortunately I do not believe that this
session was recorded on video :(
Juan - yes, this is some of the important future work, and I think it's not
hard to add to many connectors; contributions would be welcome.
In terms of a "per-key" Wait transform, yeah, that definitely needs to be
figured out too. The presentation considers only the non-per-key case but I
think it should not be hard to add a per-key one. If you need to do
something directly with the results, you can use Combine.perKey().

On Thu, Sep 27, 2018 at 10:10 AM Pablo Estrada  wrote:

> I'll take this chance to plug in my little directory of Beam
> tools/materials: https://github.com/pabloem/awesome-beam
>
> Please feel free to send PRs : )
>
>
> On Wed, Sep 26, 2018 at 10:29 PM Ankur Goenka  wrote:
>
>> Thanks for sharing. Great slides and looking for the recorded session.
>>
>> Do we have a central location where we link all the beam presentations
>> for discoverability?
>>
>> On Wed, Sep 26, 2018 at 9:35 PM Thomas Weise  wrote:
>>
>>> Thanks for sharing. I'm looking forward to see the recording of the talk
>>> (hopefully!).
>>>
>>> This will be very helpful for Beam users. IO still is typically the
>>> unexpectedly hard and time consuming part of authoring pipelines.
>>>
>>>
>>> On Wed, Sep 26, 2018 at 2:48 PM Alan Myrvold 
>>> wrote:
>>>
 Thanks for the slides.
 Really enjoyed the talk in person, especially the concept that IO is a
 transformation, and a source or sink are not special and the splittable
 DoFn explanation.

 On Wed, Sep 26, 2018 at 2:17 PM Ismaël Mejía  wrote:

> Hello, today Eugene and me did a talk about about modular APIs for IO
> at ApacheCon. This talk introduces some common patterns that we have
> found while creating IO connectors and also presents recent ideas like
> dynamic destinations, sequential writes among others using FileIO as a
> use case.
>
> In case you guys want to take a look, here is a copy of the slides, we
> will probably add this to the IO authoring documentation too.
>
> https://s.apache.org/beam-modular-io-talk
>

Re: Beam Schemas: current status

2018-08-29 Thread Eugene Kirpichov

Wow, this is really coming together, congratulations and thanks for the
great work!

On Wed, Aug 29, 2018 at 1:40 AM Reuven Lax  wrote:

> I wanted to send a quick note to the community about the current status of
> schema-aware PCollections in Beam. As some might remember we had a good
> discussion last year about the design of these schemas, involving many
> folks from different parts of the community. I sent a summary earlier this
> year explaining how schemas has been integrated into the DoFn framework.
> Much has happened since then, and here are some of the highlights.
>
> First, I want to emphasize that all the schema-aware classes are currently
> marked @Experimental. Nothing is set in stone yet, so if you have questions
> about any decisions made, please start a discussion!
>
> SQL
>
> The first big milestone for schemas was porting all of BeamSQL to use the
> framework, which was done in pr/5956. This was a lot of work, exposed many
> bugs in the schema implementation, but now provides great evidence that
> schemas work!
>
> Schema inference
>
> Beam can automatically infer schemas from Java POJOs (objects with public
> fields) or JavaBean objects (objects with getter/setter methods). Often you
> can do this by simply annotating the class. For example:
>
> @DefaultSchema(JavaFieldSchema.class)
>
> public class UserEvent {
>
>  public String userId;
>
>  public LatLong location;
>
>  Public String countryCode;
>
>  public long transactionCost;
>
>  public double transactionDuration;
>
>  public List traceMessages;
>
> };
>
> @DefaultSchema(JavaFieldSchema.class)
>
> public class LatLong {
>
>  public double latitude;
>
>  public double longitude;
>
> }
>
> Beam will automatically infer schemas for these classes! So if you have a
> PCollection, it will automatically get the following schema:
>
> UserEvent:
>
>  userId: STRING
>
>  location: ROW(LatLong)
>
>  countryCode: STRING
>
>  transactionCost: INT64
>
>  transactionDuration: DOUBLE
>
>  traceMessages: ARRAY[STRING]]
>
>
> LatLong:
>
>  latitude: DOUBLE
>
>  longitude: DOUBLE
>
> Now it’s not always possible to annotate the class like this (you may not
> own the class definition), so you can also explicitly register this using
> Pipeline:getSchemaRegistry:registerPOJO, and the same for JavaBeans.
>
> Coders
>
> Beam has a built-in coder for any schema-aware PCollection, largely
> removing the need for users to care about coders. We generate low-level
> bytecode (using ByteBuddy) to implement the coder for each schema, so these
> coders are quite performant. This provides a better default coder for Java
> POJO objects as well. In the past users were recommended to use AvroCoder
> for pojos, which many have found inefficient. Now there’s a more-efficient
> solution.
>
> Utility Transforms
>
> Schemas are already useful for implementers of extensions such as SQL, but
> the goal was to use them to make Beam itself easier to use. To this end,
> I’ve been implementing a library of transforms that allow for easy
> manipulation of schema PCollections. So far Filter and Select are merged,
> Group is about to go out for review (it needs some more javadoc and unit
> tests), and Join is being developed but doesn’t yet have a final interface.
>
> Filter
>
> Given a PCollection, I want to keep only those in an area of
> southern manhattan. Well this is easy!
>
> PCollection manhattanEvents = allEvents.apply(Filter
>
>  .whereFieldName("latitude", lat -> lat < 40.720 && lat > 40.699)
>
>  .whereFieldName("longitude", long -> long < -73.969 && long > -74.747));
>
> Schemas along with lambdas allows us to write this transform
> declaratively. The Filter transform also allows you to register filter
> functions that operate on multiple fields at the same time.
>
> Select
>
> Let’s say that I don’t need all the fields in a row. For instance, I’m
> only interested in the userId and traceMessages, and don’t care about the
> location. In that case I can write the following:
>
> PCollection selected = allEvents.apply(Select.fieldNames(“userId”, “
> traceMessages”));
>
>
> BTW, Beam also keeps track of which fields are accessed by a transform In
> the future we can automatically insert Selects in front of subgraphs to
> drop fields that are not referenced in that subgraph.
>
> Group
>
> Group is one of the more advanced transforms. In its most basic form, it
> provides a convenient way to group by key:
>
> PCollection> byUserAndCountry =
>
>allEvents.apply(Group.byFieldNames(“userId”, “countryCode”));
>
> Notice how much more concise this is than using GroupByKey directly!
>
> The Group transform really starts to shine however when you start
> specifying aggregations. You can aggregate any field (or fields) and build
> up an output schema based on these aggregations. For example:
>
> PCollection> aggregated = allEvents.apply(
>
>Group.byFieldNames(“userId”, “countryCode”)
>
>.aggregateField("cost", Sum.ofLongs(), "total_cost")
>
>

Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Eugene Kirpichov

On Tue, Jul 17, 2018 at 2:49 AM Etienne Chauchot 
wrote:

> Hi Eugene
>
> Le lundi 16 juillet 2018 à 07:52 -0700, Eugene Kirpichov a écrit :
>
> Hi Etienne - thanks for catching this; indeed, I somehow missed that
> actually several runners do this same thing - it seemed to me as something
> that can be done in user code (because it involves combining estimated size
> + split in pretty much the same way),
>
>
> When you say "user code", you mean IO writter code by opposition to runner
> code right ?
>
Correct: "user code" is what happens in the SDK or the user pipeline.


>
>
> but I'm not so sure: even though many runners have a "desired parallelism"
> option or alike, it's not all of them, so we can't use such an option
> universally.
>
>
> Agree, cannot be universal
>
>
> Maybe then the right thing to do is to:
> - Use bounded SDFs for these
> - Change SDF @SplitRestriction API to take a desired number of splits as a
> parameter, and introduce an API @EstimateOutputSizeBytes(element) valid
> only on bounded SDFs
>
> Agree with the idea but EstimateOutpuSize must return the size of the
> dataset not of an element.
>
Please recall that the element here is e.g. a filename, or name of a
BigTable table, or something like that - i.e. the element describes the
dataset, and the restriction describes what part of the dataset.

If e.g. we have a PCollection of filenames and apply a ReadTextFn
SDF to it, and want the runner to know the total size of all files - the
runner could insert some transforms to apply EstimateOutputSize to each
element and Sum.globally() them.


> On some runners, each worker is set to a given amount of heap. Thus, it is
> important that a runner could evaluate the size of the whole dataset to
> determine the size of each split (to fit in memory of the workers) and thus
> tell the bounded SDF the number of desired splits.
>
> - Add some plumbing to the standard bounded SDF expansion so that
> different runners can compute that parameter differently, the two standard
> ways being "split into given number of splits" or "split based on the
> sub-linear formula of estimated size".
>
> I think this would work, though this is somewhat more work than I
> anticipated. Any alternative ideas?
>
> +1 It will be very similar for an IO developer (@EstimateOutputSizeBytes
> will be similar to source.getEstimatedSizeBytes(),
> and @SplitRestriction(desiredSplits) similar to
> source.split(desiredBundleSize))
>
Yeah I'm not sure this is actually a good thing that these APIs end up so
similar to the old ones - I was hoping we could come up with something
better - but seems like there's no viable alternative at this point :)


>
> Etienne
>
>
> On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot 
> wrote:
>
> Hi,
> thanks Eugene for analyzing and sharing that.
> I have one comment inline
>
> Etienne
>
> Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
>
> Hey beamers,
>
> I've always wondered whether the BoundedSource implementations in the Beam
> SDK are worth their complexity, or whether they rather could be converted
> to the much easier to code ParDo style, which is also more modular and
> allows you to very easily implement readAll().
>
> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> Elasticsearch, MongoDB, Solr and a couple more.
>
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
> because AFAICT Dataflow is the only runner that cares about the things that
> BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink
> calls the function to return statistics, but doesn't seem to do anything
> else with it]
>
> => Spark uses size estimation to set desired bundle size with something
> like desiredBundleSize = estimatedSize / nbOfWorkersConfigured (partitions)
> See
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
>
>
> - splitting into bundles of given size (Dataflow chooses the number of
> bundles to create based on a simple formula that's not entirely unlike
> K*sqrt(size))
> - liquid sharding (splitAtFraction())
>
> If Dataflow didn't exist, there'd be no reason at all to use
> BoundedSource. So the question "which ones can be converted to ParDo" is
> really "which ones are used on Dataflow in ways that make these functions
> matter". Previously, my conservative assumption was that the answer is "all
> of them", but turns out this is not so.
>
> Liquid sharding always matters; if the source is li

Re: BiqQueryIO.write and Wait.on

2018-07-17 Thread Eugene Kirpichov

Hmm, I think this approach has some complications:
- Using JobStatus makes it tied to using BigQuery batch load jobs, but the
return type ought to be the same regardless of which method of writing is
used (including potential future BigQuery APIs - they are evolving), or how
many BigQuery load jobs are involved in writing a given window (it can be
multiple).
- Returning a success/failure indicator makes it prone to users ignoring
the failure: the default behavior should be that, if the pipeline succeeds,
that means all data was successfully written - if users want different
error handling, e.g. a deadletter queue, they should have to specify it
explicitly.

I would recommend to return a PCollection of a type that's invariant to
which load method is used (streaming writes, load jobs, multiple load jobs
etc.). If it's unclear what type that should be, you could introduce an
empty type e.g. "class BigQueryWriteResult {}" just for the sake of
signaling success, and later add something to it.

On Tue, Jul 17, 2018 at 12:30 AM Carlos Alonso  wrote:

> All good so far. I've been a bit side tracked but more or less I have the
> idea of using the JobStatus as part of the collection so that not only the
> completion is signaled, but also the result (success/failure) can be
> accessed, how does it sound?
>
> Regards
>
> On Tue, Jul 17, 2018 at 3:07 AM Eugene Kirpichov 
> wrote:
>
>> Hi Carlos,
>>
>> Any updates / roadblocks you hit?
>>
>>
>> On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov 
>> wrote:
>>
>>> Awesome!! Thanks for the heads up, very exciting, this is going to make
>>> a lot of people happy :)
>>>
>>> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>>>
>>>> + dev@beam.apache.org
>>>>
>>>> Just a quick email to let you know that I'm starting developing this.
>>>>
>>>> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
>>>> wrote:
>>>>
>>>>> Hi Carlos,
>>>>>
>>>>> Thank you for expressing interest in taking this on! Let me give you a
>>>>> few pointers to start, and I'll be happy to help everywhere along the way.
>>>>>
>>>>> Basically we want BigQueryIO.write() to return something (e.g. a
>>>>> PCollection) that can be used as input to Wait.on().
>>>>> Currently it returns a WriteResult, which only contains a
>>>>> PCollection of failed inserts - that one can not be used
>>>>> directly, instead we should add another component to WriteResult that
>>>>> represents the result of successfully writing some data.
>>>>>
>>>>> Given that BQIO supports dynamic destination writes, I think it makes
>>>>> sense for that to be a PCollection> so that in 
>>>>> theory
>>>>> we could sequence different destinations independently (currently 
>>>>> Wait.on()
>>>>> does not provide such a feature, but it could); and it will require
>>>>> changing WriteResult to be WriteResult. As for what the 
>>>>> "???"
>>>>> might be - it is something that represents the result of successfully
>>>>> writing a window of data. I think it can even be Void, or "?" (wildcard
>>>>> type) for now, until we figure out something better.
>>>>>
>>>>> Implementing this would require roughly the following work:
>>>>> - Add this PCollection> to WriteResult
>>>>> - Modify the BatchLoads transform to provide it on both codepaths:
>>>>> expandTriggered() and expandUntriggered()
>>>>> ...- expandTriggered() itself writes via 2 codepaths: single-partition
>>>>> and multi-partition. Both need to be handled - we need to get a
>>>>> PCollection> from each of them, and Flatten these two
>>>>> PCollections together to get the final result. The single-partition
>>>>> codepath (writeSinglePartition) under the hood already uses WriteTables
>>>>> that returns a KV so it's directly usable. The
>>>>> multi-partition codepath ends in WriteRenameTriggered - unfortunately, 
>>>>> this
>>>>> codepath drops DestinationT along the way and will need to be refactored a
>>>>> bit to keep it until the end.
>>>>> ...- expandUntriggered() should be treated the same way.
>>>>> - Modify the StreamingWriteTables transform to provide it
>>>>> ...- Here also, the challenge is to propagate the DestinationT type
>>>>> all

Re: BiqQueryIO.write and Wait.on

2018-07-16 Thread Eugene Kirpichov

Hi Carlos,

Any updates / roadblocks you hit?

On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov 
wrote:

> Awesome!! Thanks for the heads up, very exciting, this is going to make a
> lot of people happy :)
>
> On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:
>
>> + dev@beam.apache.org
>>
>> Just a quick email to let you know that I'm starting developing this.
>>
>> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
>> wrote:
>>
>>> Hi Carlos,
>>>
>>> Thank you for expressing interest in taking this on! Let me give you a
>>> few pointers to start, and I'll be happy to help everywhere along the way.
>>>
>>> Basically we want BigQueryIO.write() to return something (e.g. a
>>> PCollection) that can be used as input to Wait.on().
>>> Currently it returns a WriteResult, which only contains a
>>> PCollection of failed inserts - that one can not be used
>>> directly, instead we should add another component to WriteResult that
>>> represents the result of successfully writing some data.
>>>
>>> Given that BQIO supports dynamic destination writes, I think it makes
>>> sense for that to be a PCollection> so that in theory
>>> we could sequence different destinations independently (currently Wait.on()
>>> does not provide such a feature, but it could); and it will require
>>> changing WriteResult to be WriteResult. As for what the "???"
>>> might be - it is something that represents the result of successfully
>>> writing a window of data. I think it can even be Void, or "?" (wildcard
>>> type) for now, until we figure out something better.
>>>
>>> Implementing this would require roughly the following work:
>>> - Add this PCollection> to WriteResult
>>> - Modify the BatchLoads transform to provide it on both codepaths:
>>> expandTriggered() and expandUntriggered()
>>> ...- expandTriggered() itself writes via 2 codepaths: single-partition
>>> and multi-partition. Both need to be handled - we need to get a
>>> PCollection> from each of them, and Flatten these two
>>> PCollections together to get the final result. The single-partition
>>> codepath (writeSinglePartition) under the hood already uses WriteTables
>>> that returns a KV so it's directly usable. The
>>> multi-partition codepath ends in WriteRenameTriggered - unfortunately, this
>>> codepath drops DestinationT along the way and will need to be refactored a
>>> bit to keep it until the end.
>>> ...- expandUntriggered() should be treated the same way.
>>> - Modify the StreamingWriteTables transform to provide it
>>> ...- Here also, the challenge is to propagate the DestinationT type all
>>> the way until the end of StreamingWriteTables - it will need to be
>>> refactored. After such a refactoring, returning a KV
>>> should be easy.
>>>
>>> Another challenge with all of this is backwards compatibility in terms
>>> of API and pipeline update.
>>> Pipeline update is much less of a concern for the BatchLoads codepath,
>>> because it's typically used in batch-mode pipelines that don't get updated.
>>> I would recommend to start with this, perhaps even with only the
>>> untriggered codepath (it is much more commonly used) - that will pave the
>>> way for future work.
>>>
>>> Hope this helps, please ask more if something is unclear!
>>>
>>> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso 
>>> wrote:
>>>
>>>> Hey Eugene!!
>>>>
>>>> I’d gladly take a stab on it although I’m not sure how much available
>>>> time I might have to put into but... yeah, let’s try it.
>>>>
>>>> Where should I begin? Is there a Jira issue or shall I file one?
>>>>
>>>> Thanks!
>>>> On Thu, 12 Apr 2018 at 00:41, Eugene Kirpichov 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Yes, you're both right - BigQueryIO.write() is currently not
>>>>> implemented in a way that it can be used with Wait.on(). It would 
>>>>> certainly
>>>>> be a welcome contribution to change this - many people expressed interest
>>>>> in specifically waiting for BigQuery writes. Is any of you interested in
>>>>> helping out?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso 
>>>>> wrote:
>>>>>
>>

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Eugene Kirpichov

We did not, but I think we should. So far, in 100% of the PRs I've
authored, the default functionality of CODEOWNERS did the wrong thing and I
had to fix something up manually.

On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud  wrote:

> This sounds like a good plan. Did we want to rename the CODEOWNERS file to
> disable github's mass adding of reviewers while we figure this out?
>
> Andrew
>
> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> Le 16 juil. 2018, à 19:17, Holden Karau  a écrit:
>>>
>>> Ok if no one objects I'll create the INFRA ticket after OSCON and we can
>>> test it for a week and decide if it helps or hinders.
>>>
>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
>>> wrote:
>>>
>>>> Agree to test it for a week.
>>>>
>>>> Regards
>>>> JB
>>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a
>>>> écrit:
>>>>>
>>>>> Would folks be OK with me asking infra to turn on blame based
>>>>> suggestions for Beam and trying it out for a week?
>>>>>
>>>>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com>
>>>>> wrote:
>>>>>
>>>>>> +1 using blame -- nifty :)
>>>>>>
>>>>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1. This is great.
>>>>>>>
>>>>>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Mention bot looks cool, as it tries to guess the reviewer using
>>>>>>>> blame.
>>>>>>>> I've written a quick and dirty script that uses only CODEOWNERS.
>>>>>>>>
>>>>>>>> Its output looks like:
>>>>>>>> $ python suggest_reviewers.py --pr 5940
>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>>>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>>>>>>> (path_pattern: /runners/core-construction-java*)
>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>>>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>>>>>>>> (path_pattern: /runners/core-construction-java*)
>>>>>>>> INFO:root:Selected reviewer @echauchot for:
>>>>>>>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>>>>>>>> (path_pattern: /runners/core-java*)
>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>>>>> /runners/flink/build.gradle (path_pattern: */build.gradle*)
>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>>>>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>>>>>>>> (path_pattern: *.java)
>>>>>>>> INFO:root:Selected reviewer @pabloem for:
>>>>>>>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>>>>>>>> (path_pattern: /runners/google-cloud-dataflow-java*)
>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>>>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>>>>>>>> (path_pattern: /sdks/java/core*)
>>>>>>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>>>>>>>
>>>>>>>> Script is in: https://github.com/apache/beam/pull/5951
>>>>>>>>
>>>>>>>>
>>>>>>>> What does the community think? Do you prefer blame-based or
>>>>>>>> rules-based reviewer suggestions?
>>>>>>>>
>>>>>>>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau <
>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>
>>>>>>>>> I'm looking at something similar in the Spark project, and while
>>>>>>>>> it's now archived by FB it s

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov

Hey all,

The PR https://github.com/apache/beam/pull/5940 was merged, and now all
runners at "master" support bounded-per-element SDFs!
Thanks +Ismaël Mejía  for the reviews.
I have updated the Capability Matrix as well:
https://beam.apache.org/documentation/runners/capability-matrix/


On Mon, Jul 16, 2018 at 7:56 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> I think it's the purpose of SDF to simplify the BoundedSource like writing.
>
> I agree that extended @SplitRestriction is a good approach.
>
> Regards
> JB
>
> On 16/07/2018 16:52, Eugene Kirpichov wrote:
> > Hi Etienne - thanks for catching this; indeed, I somehow missed that
> > actually several runners do this same thing - it seemed to me as
> > something that can be done in user code (because it involves combining
> > estimated size + split in pretty much the same way), but I'm not so
> > sure: even though many runners have a "desired parallelism" option or
> > alike, it's not all of them, so we can't use such an option universally.
> >
> > Maybe then the right thing to do is to:
> > - Use bounded SDFs for these
> > - Change SDF @SplitRestriction API to take a desired number of splits as
> > a parameter, and introduce an API @EstimateOutputSizeBytes(element)
> > valid only on bounded SDFs
> > - Add some plumbing to the standard bounded SDF expansion so that
> > different runners can compute that parameter differently, the two
> > standard ways being "split into given number of splits" or "split based
> > on the sub-linear formula of estimated size".
> >
> > I think this would work, though this is somewhat more work than I
> > anticipated. Any alternative ideas?
> >
> > On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot  > <mailto:echauc...@apache.org>> wrote:
> >
> > Hi,
> > thanks Eugene for analyzing and sharing that.
> > I have one comment inline
> >
> > Etienne
> >
> > Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
> >> Hey beamers,
> >>
> >> I've always wondered whether the BoundedSource implementations in
> >> the Beam SDK are worth their complexity, or whether they rather
> >> could be converted to the much easier to code ParDo style, which
> >> is also more modular and allows you to very easily implement
> >> readAll().
> >>
> >> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> >> Elasticsearch, MongoDB, Solr and a couple more.
> >>
> >> Curiously enough, BoundedSource vs. ParDo matters *only* on
> >> Dataflow, because AFAICT Dataflow is the only runner that cares
> >> about the things that BoundedSource can do and ParDo can't:
> >> - size estimation (used to choose an initial number of workers)
> >> [ok, Flink calls the function to return statistics, but doesn't
> >> seem to do anything else with it]
> > => Spark uses size estimation to set desired bundle size with
> > something like desiredBundleSize = estimatedSize /
> > nbOfWorkersConfigured (partitions)
> > See
> >
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
> >
> >
> >> - splitting into bundles of given size (Dataflow chooses the
> >> number of bundles to create based on a simple formula that's not
> >> entirely unlike K*sqrt(size))
> >> - liquid sharding (splitAtFraction())
> >>
> >> If Dataflow didn't exist, there'd be no reason at all to use
> >> BoundedSource. So the question "which ones can be converted to
> >> ParDo" is really "which ones are used on Dataflow in ways that
> >> make these functions matter". Previously, my conservative
> >> assumption was that the answer is "all of them", but turns out
> >> this is not so.
> >>
> >> Liquid sharding always matters; if the source is liquid-shardable,
> >> for now we have to keep it a source (until SDF gains liquid
> >> sharding - which should happen in a quarter or two I think).
> >>
> >> Choosing number of bundles to split into is easily done in SDK
> >> code, see https://github.com/apache/beam/pull/5886 for example;
> >> DatastoreIO does something similar.
> >>
> >> The remaining thing to analyze is, when does initial scaling
> >> matt

An update on Eugene

2018-07-16 Thread Eugene Kirpichov

Hi beamers,

After 5.5 years working on data processing systems at Google, several of
these years working on Dataflow and Beam, I am moving on to do something
new (also at Google) in the area of programming models for machine
learning. Anybody who worked with me closely knows how much I love building
programming models, so I could not pass up on the opportunity to build a
new one - I expect to have a lot of fun there!

On the new team we very much plan to make things open-source when the time
is right, and make use of Beam, just as TensorFlow does - so I will stay in
touch with the community, and I expect that we will still work together on
some things. However, Beam will no longer be the main focus of my work.

I've made the decision a couple months ago and have spent the time since
then getting things into a good state and handing over the community
efforts in which I have played a particularly active role - they are in
very capable hands:
- Robert Bradshaw and Ankur Goenka on Google side are taking charge of
Portable Runners (e.g. the Portable Flink runner).
- Luke Cwik will be in charge of the future of Splittable DoFn. Ismael
Mejia has also been involved in the effort and actively helping, and I
believe he continues to do so.
- The Beam IO ecosystem in general is in very good shape (perhaps the best
in the industry) and does not need a lot of constant direction; and it has
a great community (thanks JB, Ismael, Etienne and many others!) - however,
on Google side, Chamikara Jayalath will take it over.

It was a great pleasure working with you all. My last day formally on Beam
will be this coming Friday, then I'll take a couple weeks of vacation and
jump right in on the new team.

Of course, if my involvement in something is necessary, I'm still available
on all the same channels as always (email, Slack, Hangouts) - but, in
general, please contact the folks mentioned above instead of me about the
respective matters from now on.

Thanks!

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov

Hi Etienne - thanks for catching this; indeed, I somehow missed that
actually several runners do this same thing - it seemed to me as something
that can be done in user code (because it involves combining estimated size
+ split in pretty much the same way), but I'm not so sure: even though many
runners have a "desired parallelism" option or alike, it's not all of them,
so we can't use such an option universally.

Maybe then the right thing to do is to:
- Use bounded SDFs for these
- Change SDF @SplitRestriction API to take a desired number of splits as a
parameter, and introduce an API @EstimateOutputSizeBytes(element) valid
only on bounded SDFs
- Add some plumbing to the standard bounded SDF expansion so that different
runners can compute that parameter differently, the two standard ways being
"split into given number of splits" or "split based on the sub-linear
formula of estimated size".

I think this would work, though this is somewhat more work than I
anticipated. Any alternative ideas?

On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot 
wrote:

> Hi,
> thanks Eugene for analyzing and sharing that.
> I have one comment inline
>
> Etienne
>
> Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
>
> Hey beamers,
>
> I've always wondered whether the BoundedSource implementations in the Beam
> SDK are worth their complexity, or whether they rather could be converted
> to the much easier to code ParDo style, which is also more modular and
> allows you to very easily implement readAll().
>
> There's a handful: file-based sources, BigQuery, Bigtable, HBase,
> Elasticsearch, MongoDB, Solr and a couple more.
>
> Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
> because AFAICT Dataflow is the only runner that cares about the things that
> BoundedSource can do and ParDo can't:
> - size estimation (used to choose an initial number of workers) [ok, Flink
> calls the function to return statistics, but doesn't seem to do anything
> else with it]
>
> => Spark uses size estimation to set desired bundle size with something
> like desiredBundleSize = estimatedSize / nbOfWorkersConfigured (partitions)
> See
> https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org/apache/beam/runners/spark/io/SourceRDD.java#L101
>
>
> - splitting into bundles of given size (Dataflow chooses the number of
> bundles to create based on a simple formula that's not entirely unlike
> K*sqrt(size))
> - liquid sharding (splitAtFraction())
>
> If Dataflow didn't exist, there'd be no reason at all to use
> BoundedSource. So the question "which ones can be converted to ParDo" is
> really "which ones are used on Dataflow in ways that make these functions
> matter". Previously, my conservative assumption was that the answer is "all
> of them", but turns out this is not so.
>
> Liquid sharding always matters; if the source is liquid-shardable, for now
> we have to keep it a source (until SDF gains liquid sharding - which should
> happen in a quarter or two I think).
>
> Choosing number of bundles to split into is easily done in SDK code, see
> https://github.com/apache/beam/pull/5886 for example; DatastoreIO does
> something similar.
>
> The remaining thing to analyze is, when does initial scaling matter. So as
> a member of the Dataflow team, I analyzed statistics of production Dataflow
> jobs in the past month. I can not share my queries nor the data, because
> they are proprietary to Google - so I am sharing just the general
> methodology and conclusions, because they matter to the Beam community. I
> looked at a few criteria, such as:
> - The job should be not too short and not too long: if it's too short then
> scaling couldn't have kicked in much at all; if it's too long then dynamic
> autoscaling would have been sufficient.
> - The job should use, at peak, at least a handful of workers (otherwise
> means it wasn't used in settings where much scaling happened)
> After a couple more rounds of narrowing-down, with some hand-checking that
> the results and criteria so far make sense, I ended up with nothing - no
> jobs that would have suffered a serious performance regression if their
> BoundedSource had not supported initial size estimation [of course, except
> for the liquid-shardable ones].
>
> Based on this, I would like to propose to convert the following
> BoundedSource-based IOs to ParDo-based, and while we're at it, probably
> also add readAll() versions (not necessarily in exactly the same PR):
> - ElasticsearchIO
> - SolrIO
> - MongoDbIO
> - MongoDbGridFSIO
> - CassandraIO
> - HCatalogIO
> - HadoopInputFormatIO
> - UnboundedToBoundedSourceAdapter (alr

Let's start getting rid of BoundedSource

2018-07-15 Thread Eugene Kirpichov

Hey beamers,

I've always wondered whether the BoundedSource implementations in the Beam
SDK are worth their complexity, or whether they rather could be converted
to the much easier to code ParDo style, which is also more modular and
allows you to very easily implement readAll().

There's a handful: file-based sources, BigQuery, Bigtable, HBase,
Elasticsearch, MongoDB, Solr and a couple more.

Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow,
because AFAICT Dataflow is the only runner that cares about the things that
BoundedSource can do and ParDo can't:
- size estimation (used to choose an initial number of workers) [ok, Flink
calls the function to return statistics, but doesn't seem to do anything
else with it]
- splitting into bundles of given size (Dataflow chooses the number of
bundles to create based on a simple formula that's not entirely unlike
K*sqrt(size))
- liquid sharding (splitAtFraction())

If Dataflow didn't exist, there'd be no reason at all to use BoundedSource.
So the question "which ones can be converted to ParDo" is really "which
ones are used on Dataflow in ways that make these functions matter".
Previously, my conservative assumption was that the answer is "all of
them", but turns out this is not so.

Liquid sharding always matters; if the source is liquid-shardable, for now
we have to keep it a source (until SDF gains liquid sharding - which should
happen in a quarter or two I think).

Choosing number of bundles to split into is easily done in SDK code, see
https://github.com/apache/beam/pull/5886 for example; DatastoreIO does
something similar.

The remaining thing to analyze is, when does initial scaling matter. So as
a member of the Dataflow team, I analyzed statistics of production Dataflow
jobs in the past month. I can not share my queries nor the data, because
they are proprietary to Google - so I am sharing just the general
methodology and conclusions, because they matter to the Beam community. I
looked at a few criteria, such as:
- The job should be not too short and not too long: if it's too short then
scaling couldn't have kicked in much at all; if it's too long then dynamic
autoscaling would have been sufficient.
- The job should use, at peak, at least a handful of workers (otherwise
means it wasn't used in settings where much scaling happened)
After a couple more rounds of narrowing-down, with some hand-checking that
the results and criteria so far make sense, I ended up with nothing - no
jobs that would have suffered a serious performance regression if their
BoundedSource had not supported initial size estimation [of course, except
for the liquid-shardable ones].

Based on this, I would like to propose to convert the following
BoundedSource-based IOs to ParDo-based, and while we're at it, probably
also add readAll() versions (not necessarily in exactly the same PR):
- ElasticsearchIO
- SolrIO
- MongoDbIO
- MongoDbGridFSIO
- CassandraIO
- HCatalogIO
- HadoopInputFormatIO
- UnboundedToBoundedSourceAdapter (already have a PR in progress for this
one)
These would not translate to a single ParDo - rather, they'd translate to
ParDo(estimate size and split according to the formula), Reshuffle,
ParDo(read data) - or possibly to a bounded SDF doing roughly the same
(luckily after https://github.com/apache/beam/pull/5940 all runners at
master will support bounded SDF so this is safe compatibility-wise). Pretty
much like DatastoreIO does.

I would like to also propose to change the IO authoring guide
https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-source-api
to
basically say "Never implement a new BoundedSource unless you can support
liquid sharding". And add a utility for computing a desired number of
splits.

There might be some more details here to iron out, but I wanted to check
with the community that this overall makes sense.

Thanks.

Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Eugene Kirpichov

Sounds reasonable for now, thanks!
It's unfortunate that Github's CODEOWNERS feature appears to be effectively
unusable for Beam but I'd hope that Github might pay attention and fix
things if we submit feedback, with us being one of the most active Apache
projects - did anyone do this yet / planning to?

On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:

> While I like the idea of having a CODEOWNERS file, the Github
> implementation is lacking:
> 1. Reviewers are automatically assigned at each push.
> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's PR
> 5940).
> 3. Non-committers aren't assigned as reviewers.
> 4. Non-committers can't change the list of reviewers.
>
> I propose renaming the file to disable the auto-reviewer assignment
> feature.
> In its place I'll add a script that suggests reviewers.
>
> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>
>> Hi Etienne,
>>
>> Yes you could be as precise as you want. The paths I listed are just
>> suggestions. :)
>>
>>
>> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> I think it's already do-able just providing the expected path.
>>>
>>> It's a good idea especially for the core.
>>>
>>> Regards
>>> JB
>>>
>>> On 13/07/2018 09:51, Etienne Chauchot wrote:
>>> > Hi Udi,
>>> >
>>> > I also have a question, related to what Eugene asked : I see that the
>>> > code paths are the ones of the modules. Can we be more precise than
>>> that
>>> > to assign reviewers ? As an example, I added myself to runner/core
>>> > because I wanted to take a look at the PRs related to
>>> > runner/core/metrics but I'm getting assigned to all runner-core PRs.
>>> Can
>>> > we specify paths like
>>> > runners/core-java/src/main/java/org/apache/beam/runners/core/metrics ?
>>> > I know it is a bit too precise so a bit risky, but in that particular
>>> > case, I doubt that the path will change.
>>> >
>>> > Etienne
>>> >
>>> > Le jeudi 12 juillet 2018 à 16:49 -0700, Eugene Kirpichov a écrit :
>>> >> Hi Udi,
>>> >>
>>> >> I see that the PR was merged - thanks! However it seems to have some
>>> >> unintended effects.
>>> >>
>>> >> On my PR https://github.com/apache/beam/pull/5940 , I assigned a
>>> >> reviewer manually, but the moment I pushed a new commit, it
>>> >> auto-assigned a lot of other people to it, and I had to remove them.
>>> >> This seems like a big inconvenience to me, is there a way to disable
>>> this?
>>> >>
>>> >> Thanks.
>>> >>
>>> >> On Thu, Jul 12, 2018 at 2:53 PM Udi Meiri >> >> <mailto:eh...@google.com>> wrote:
>>> >>> :/ That makes it a little less useful.
>>> >>>
>>> >>> On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson
>>> >>> mailto:timrobertson...@gmail.com>>
>>> wrote:
>>> >>>> Hi Udi
>>> >>>>
>>> >>>> I asked the GH helpdesk and they confirmed that only people with
>>> >>>> write access will actually be automatically chosen.
>>> >>>>
>>> >>>> It don't expect it should stop us using it, but we should be aware
>>> >>>> that there are non-committers also willing to review.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Tim
>>> >>>>
>>> >>>> On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin
>>> >>>> mailto:mig...@google.com>> wrote:
>>> >>>>> Idea looks good in general.
>>> >>>>>
>>> >>>>> Did you look into ways to keep this file up-to-date? For example we
>>> >>>>> can run monthly job to see if owner was active during this period.
>>> >>>>>
>>> >>>>> --Mikhail
>>> >>>>>
>>> >>>>> Have feedback <http://go/migryz-feedback>?
>>> >>>>>
>>> >>>>>
>>> >>>>> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri >> >>>>> <mailto:eh...@google.com>> wrote:
>>> >>>>>> Thanks all!
>>> >>>>>> I'll try to get the file merg

Re: CODEOWNERS for apache/beam repo

2018-07-12 Thread Eugene Kirpichov

Hi Udi,

I see that the PR was merged - thanks! However it seems to have some
unintended effects.

On my PR https://github.com/apache/beam/pull/5940 , I assigned a reviewer
manually, but the moment I pushed a new commit, it auto-assigned a lot of
other people to it, and I had to remove them. This seems like a big
inconvenience to me, is there a way to disable this?

Thanks.

On Thu, Jul 12, 2018 at 2:53 PM Udi Meiri  wrote:

> :/ That makes it a little less useful.
>
> On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson 
> wrote:
>
>> Hi Udi
>>
>> I asked the GH helpdesk and they confirmed that only people with write
>> access will actually be automatically chosen.
>>
>> It don't expect it should stop us using it, but we should be aware that
>> there are non-committers also willing to review.
>>
>> Thanks,
>> Tim
>>
>> On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin 
>> wrote:
>>
>>> Idea looks good in general.
>>>
>>> Did you look into ways to keep this file up-to-date? For example we can
>>> run monthly job to see if owner was active during this period.
>>>
>>> --Mikhail
>>>
>>> Have feedback ?
>>>
>>>
>>> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri  wrote:
>>>
 Thanks all!
 I'll try to get the file merged today and see how it works out.
 Please surface any issues, such as with auto-assignment, here or in
 JIRA.

 On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot 
 wrote:

> Hi,
>
> I added myself as a reviewer for some modules.
>
> Etienne
>
> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
>
> Hi everyone,
>
> I'm proposing to add auto-reviewer-assignment using Github's
> CODEOWNERS mechanism.
> Initial version is here: *https://github.com/apache/beam/pull/5909/files
> *
>
> I need help from the community in determining owners for each
> component.
> Feel free to directly edit the PR (if you have permission) or add a
> comment.
>
>
> Background
> The idea is to:
> 1. Document good review candidates for each component.
> 2. Help choose reviewers using the auto-assignment mechanism. The
> suggestion is in no way binding.
>
>
>
>>

Building the Java SDK container with Jib?

2018-07-09 Thread Eugene Kirpichov

Hi,

Apparently a new tool has come out that lets you build Java containers
cheaply, without even having Docker installed:
https://cloudplatform.googleblog.com/2018/07/introducing-jib-build-java-docker-images-better.html


Anyone interested in giving it a shot, to have faster turnaround when
making changes to the Java SDK harness?

Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Eugene Kirpichov

Hi -

If I remember correctly, the reason for this change was to ensure that the
state is encodable at all. Prior to the change, there had been situations
where the coder specified on a state cell is buggy, absent or set
incorrectly (due to some issue in coder inference), but direct runner did
not detect this because it never tried to encode the state cells - this
would have blown up in any distributed runner.

I think it should be possible to relax this and clone only values being
added to the state, rather than cloning the whole state on copy(). I don't
have time to work on this change myself, but I can review a PR if someone
else does.

On Mon, Jul 9, 2018 at 8:28 AM Jean-Baptiste Onofré  wrote:

> Hi Vojta,
>
> I fully agree, that's why it makes sense to wait Eugene's feedback.
>
> I remember we had some performance regression on the direct runner
> identified thanks to Nexmark, but it has been addressed by reverting a
> change.
>
> Good catch anyway !
>
> Regards
> JB
>
> On 09/07/2018 17:20, Vojtech Janota wrote:
> > Hi Reuven,
> >
> > I'm not really complaining about DirectRunner. In fact it seems to me as
> > if what previously was considered as part of the "expensive extra
> > checks" done by the DirectRunner is now done within the
> > beam-runners-core-java library. Considering that all objects involved
> > are immutable (in our case at least) and simple assignment is
> > sufficient, the serialization-deserialization really seems as unwanted
> > and hugely expensive correctness check. If there was a problem with
> > identity copy, wasn't DirectRunner supposed to reveal it?
> >
> > Regards,
> > Vojta
> >
> > On Mon, Jul 9, 2018 at 4:46 PM, Reuven Lax  > > wrote:
> >
> > Hi Vojita,
> >
> > One problem is that the DirectRunner is designed for testing, not
> > for performance. The DirectRunner currently does many
> > purposely-inefficient things, the point of which is to better expose
> > potential bugs in tests. For example, the DirectRunner will randomly
> > shuffle the order of PCollections to ensure that your code does not
> > rely on ordering.  All of this adds cost, because the current runner
> > is designed for testing. There have been requests in the past for an
> > "optimized" local runner, however we don't currently have such a
> thing.
> >
> > In this case, using coders to clone values is more correct. In a
> > distributed environment using encode/decode is the only way to copy
> > values, and the DirectRunner is trying to ensure that your code is
> > correct in a distributed environment.
> >
> > Reuven
> >
> > On Mon, Jul 9, 2018 at 7:22 AM Vojtech Janota
> > mailto:vojta.jan...@gmail.com>> wrote:
> >
> > Hi,
> >
> > We are using Apache Beam in our project for some time now. Since
> > our datasets are of modest size, we have so far used
> > DirectRunner as the computation easily fits onto a single
> > machine. Recently we upgraded Beam from 2.2 to 2.4 and found out
> > that performance of our pipelines drastically deteriorated.
> > Pipelines that took ~3 minutes with 2.2 do not finish within
> > hours now. We tried to isolate the change that causes the
> > slowdown and came to the commits into the
> > "InMemoryStateInternals" class:
> >
> > * https://github.com/apache/beam/commit/32a427c
> > 
> > * https://github.com/apache/beam/commit/8151d82
> > 
> >
> > In a nutshell where previously the copy() method simply assigned:
> >
> >   that.value = this.value
> >
> > There is now coder encode/decode combo hidden behind:
> >
> >   that.value = uncheckedClone(coder, this.value)
> >
> > Can somebody explain the purpose of this change? Is it meant as
> > an additional "enforcement" point, similar to DirectRunner's
> > enforceImmutability and enforceEncodability? Or is it something
> > that is genuinely needed to provide correct behaviour of the
> > pipeline?
> >
> > Any hints or thoughts are appreciated.
> >
> > Regards,
> > Vojta
> >
> >
> >
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: BiqQueryIO.write and Wait.on

2018-07-03 Thread Eugene Kirpichov

Awesome!! Thanks for the heads up, very exciting, this is going to make a
lot of people happy :)

On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso  wrote:

> + dev@beam.apache.org
>
> Just a quick email to let you know that I'm starting developing this.
>
> On Fri, Apr 20, 2018 at 10:30 PM Eugene Kirpichov 
> wrote:
>
>> Hi Carlos,
>>
>> Thank you for expressing interest in taking this on! Let me give you a
>> few pointers to start, and I'll be happy to help everywhere along the way.
>>
>> Basically we want BigQueryIO.write() to return something (e.g. a
>> PCollection) that can be used as input to Wait.on().
>> Currently it returns a WriteResult, which only contains a
>> PCollection of failed inserts - that one can not be used
>> directly, instead we should add another component to WriteResult that
>> represents the result of successfully writing some data.
>>
>> Given that BQIO supports dynamic destination writes, I think it makes
>> sense for that to be a PCollection> so that in theory
>> we could sequence different destinations independently (currently Wait.on()
>> does not provide such a feature, but it could); and it will require
>> changing WriteResult to be WriteResult. As for what the "???"
>> might be - it is something that represents the result of successfully
>> writing a window of data. I think it can even be Void, or "?" (wildcard
>> type) for now, until we figure out something better.
>>
>> Implementing this would require roughly the following work:
>> - Add this PCollection> to WriteResult
>> - Modify the BatchLoads transform to provide it on both codepaths:
>> expandTriggered() and expandUntriggered()
>> ...- expandTriggered() itself writes via 2 codepaths: single-partition
>> and multi-partition. Both need to be handled - we need to get a
>> PCollection> from each of them, and Flatten these two
>> PCollections together to get the final result. The single-partition
>> codepath (writeSinglePartition) under the hood already uses WriteTables
>> that returns a KV so it's directly usable. The
>> multi-partition codepath ends in WriteRenameTriggered - unfortunately, this
>> codepath drops DestinationT along the way and will need to be refactored a
>> bit to keep it until the end.
>> ...- expandUntriggered() should be treated the same way.
>> - Modify the StreamingWriteTables transform to provide it
>> ...- Here also, the challenge is to propagate the DestinationT type all
>> the way until the end of StreamingWriteTables - it will need to be
>> refactored. After such a refactoring, returning a KV
>> should be easy.
>>
>> Another challenge with all of this is backwards compatibility in terms of
>> API and pipeline update.
>> Pipeline update is much less of a concern for the BatchLoads codepath,
>> because it's typically used in batch-mode pipelines that don't get updated.
>> I would recommend to start with this, perhaps even with only the
>> untriggered codepath (it is much more commonly used) - that will pave the
>> way for future work.
>>
>> Hope this helps, please ask more if something is unclear!
>>
>> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso 
>> wrote:
>>
>>> Hey Eugene!!
>>>
>>> I’d gladly take a stab on it although I’m not sure how much available
>>> time I might have to put into but... yeah, let’s try it.
>>>
>>> Where should I begin? Is there a Jira issue or shall I file one?
>>>
>>> Thanks!
>>> On Thu, 12 Apr 2018 at 00:41, Eugene Kirpichov 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes, you're both right - BigQueryIO.write() is currently not
>>>> implemented in a way that it can be used with Wait.on(). It would certainly
>>>> be a welcome contribution to change this - many people expressed interest
>>>> in specifically waiting for BigQuery writes. Is any of you interested in
>>>> helping out?
>>>>
>>>> Thanks.
>>>>
>>>> On Fri, Apr 6, 2018 at 12:36 AM Carlos Alonso 
>>>> wrote:
>>>>
>>>>> Hi Simon, I think your explanation was very accurate, at least to my
>>>>> understanding. I'd also be interested in getting batch load result's
>>>>> feedback on the pipeline... hopefully someone may suggest something,
>>>>> otherwise we could propose submitting a Jira, or even better, a PR!! :)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thu, Apr 5, 2018 at 2:01 PM Simon Kitch

Re: Unbounded source translation for portable pipelines

2018-07-02 Thread Eugene Kirpichov

Updated the Flink section.

To run a basic Python wordcount (sent to you in a separate thread, but
repeating here too for others to play with):

Step 1: Run once to build a container: "./gradlew -p sdks/python/container
docker"
Step 2: ./gradlew :beam-runners-flink_2.11-job-server:runShadow - this
starts up a local Flink portable JobService endpoint on localhost:8099
Step 3: run things using PortableRunner pointed at this endpoint - see e.g.
https://github.com/apache/beam/pull/5824/files

On Thu, Jun 28, 2018 at 1:37 AM Thomas Weise  wrote:

> Ankur/Eugene,
>
> When you have a chance, please also update the Flink section of:
> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0
>
> Thanks!
>
> On Thu, Jun 28, 2018 at 10:29 AM Thomas Weise  wrote:
>
>> The command to run the job server appears to be: ./gradlew -p
>> runners/flink/job-server runShadow
>>
>> Can you please provide the equivalent of the super basic Python example
>> from the prototype:
>>
>>
>> https://github.com/bsidhom/beam/blob/hacking-job-server/sdks/python/flink-example.py
>>
>> Looks as if the Python side runner changed:
>>
>> Traceback (most recent call last):
>>   File "flink-example.py", line 7, in 
>> from apache_beam.runners.portability import universal_local_runner
>> ImportError: cannot import name universal_local_runner
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Jun 27, 2018 at 9:34 PM Eugene Kirpichov 
>> wrote:
>>
>>> Hi!
>>>
>>> Those instructions are not current and I think should be discarded as
>>> they referred to a particular effort that is over - +Ankur Goenka
>>>  is, I believe, working on the remaining finishing
>>> touches for running from a clean clone of Beam master and documenting how
>>> to do that; could you help Thomas so we can start looking at what the
>>> streaming runner is missing?
>>>
>>> We'll need to document this in a more prominent place. When we get to a
>>> state where we can run Python WordCount from master, we'll need to document
>>> it somewhere on the main portability page and/or the getting started guide;
>>> when we can run something more serious, e.g. Tensorflow pipelines, that
>>> will be worth a Beam blog post and worth documenting in the TFX
>>> documentation.
>>>
>>> On Wed, Jun 27, 2018 at 5:35 AM Thomas Weise  wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> The basic streaming translation is already in place from the prototype,
>>>> though I have not verified it on the master branch yet.
>>>>
>>>> Are the user instructions for the portable Flink runner at
>>>> https://s.apache.org/beam-portability-team-doc current?
>>>>
>>>> (I don't have a dependency on SDF since we are going to use custom
>>>> native Flink sources/sinks at this time.)
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Tue, Jun 26, 2018 at 2:13 AM Eugene Kirpichov 
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> Wanted to let you know that I've just merged the PR that adds
>>>>> checkpointable SDF support to the portable reference runner (ULR) and the
>>>>> Java SDK harness:
>>>>>
>>>>> https://github.com/apache/beam/pull/5566
>>>>>
>>>>> So now we have a reference implementation of SDF support in a portable
>>>>> runner, and a reference implementation of SDF support in a portable SDK
>>>>> harness.
>>>>> From here on, we need to replicate this support in other portable
>>>>> runners and other harnesses. The obvious targets are Flink and Python
>>>>> respectively.
>>>>>
>>>>> Chamikara was going to work on the Python harness. +Thomas Weise
>>>>>  Would you be interested in the Flink portable
>>>>> streaming runner side? It is of course blocked by having the rest of that
>>>>> runner working in streaming mode though (the batch mode is practically 
>>>>> done
>>>>> - will send you a separate note about the status of that).
>>>>>
>>>>> On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov <
>>>>> kirpic...@google.com> wrote:
>>>>>
>>>>>> Luke is right - unbounded sources should go through SDF. I am
>>>>>> currently working on ad

Re: Unbounded source translation for portable pipelines

2018-06-27 Thread Eugene Kirpichov

Hi!

Those instructions are not current and I think should be discarded as they
referred to a particular effort that is over - +Ankur Goenka
 is, I believe, working on the remaining finishing
touches for running from a clean clone of Beam master and documenting how
to do that; could you help Thomas so we can start looking at what the
streaming runner is missing?

We'll need to document this in a more prominent place. When we get to a
state where we can run Python WordCount from master, we'll need to document
it somewhere on the main portability page and/or the getting started guide;
when we can run something more serious, e.g. Tensorflow pipelines, that
will be worth a Beam blog post and worth documenting in the TFX
documentation.

On Wed, Jun 27, 2018 at 5:35 AM Thomas Weise  wrote:

> Hi Eugene,
>
> The basic streaming translation is already in place from the prototype,
> though I have not verified it on the master branch yet.
>
> Are the user instructions for the portable Flink runner at
> https://s.apache.org/beam-portability-team-doc current?
>
> (I don't have a dependency on SDF since we are going to use custom native
> Flink sources/sinks at this time.)
>
> Thanks,
> Thomas
>
>
> On Tue, Jun 26, 2018 at 2:13 AM Eugene Kirpichov 
> wrote:
>
>> Hi!
>>
>> Wanted to let you know that I've just merged the PR that adds
>> checkpointable SDF support to the portable reference runner (ULR) and the
>> Java SDK harness:
>>
>> https://github.com/apache/beam/pull/5566
>>
>> So now we have a reference implementation of SDF support in a portable
>> runner, and a reference implementation of SDF support in a portable SDK
>> harness.
>> From here on, we need to replicate this support in other portable runners
>> and other harnesses. The obvious targets are Flink and Python respectively.
>>
>> Chamikara was going to work on the Python harness. +Thomas Weise
>>  Would you be interested in the Flink portable streaming
>> runner side? It is of course blocked by having the rest of that runner
>> working in streaming mode though (the batch mode is practically done - will
>> send you a separate note about the status of that).
>>
>> On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov 
>> wrote:
>>
>>> Luke is right - unbounded sources should go through SDF. I am currently
>>> working on adding such support to Fn API.
>>> The relevant document is s.apache.org/beam-breaking-fusion (note: it
>>> focuses on a much more general case, but also considers in detail the
>>> specific case of running unbounded sources on Fn API), and the first
>>> related PR is https://github.com/apache/beam/pull/4743 .
>>>
>>> Ways you can help speed up this effort:
>>> - Make necessary changes to Apex runner per se to support regular SDFs
>>> in streaming (without portability). They will likely largely carry over to
>>> portable world. I recall that the Apex runner had some level of support of
>>> SDFs, but didn't pass the ValidatesRunner tests yet.
>>> - (general to Beam, not Apex-related per se) Implement the translation
>>> of Read.from(UnboundedSource) via impulse, which will require implementing
>>> an SDF that reads from a given UnboundedSource (taking the UnboundedSource
>>> as an element). This should be fairly straightforward and will allow all
>>> portable runners to take advantage of existing UnboundedSource's.
>>>
>>>
>>> On Fri, Mar 23, 2018 at 3:08 PM Lukasz Cwik  wrote:
>>>
>>>> Using impulse is a precursor for both bounded and unbounded SDF.
>>>>
>>>> This JIRA represents the work that would be to add support for
>>>> unbounded SDF using portability APIs:
>>>> https://issues.apache.org/jira/browse/BEAM-2939
>>>>
>>>>
>>>> On Fri, Mar 23, 2018 at 11:46 AM Thomas Weise  wrote:
>>>>
>>>>> So for streaming, we will need the Impulse translation for bounded
>>>>> input, identical with batch, and then in addition to that support for SDF?
>>>>>
>>>>> Any pointers what's involved in adding the SDF support? Is it runner
>>>>> specific? Does the ULR cover it?
>>>>>
>>>>>
>>>>> On Fri, Mar 23, 2018 at 11:26 AM, Lukasz Cwik 
>>>>> wrote:
>>>>>
>>>>>> All "sources" in portability will use splittable DoFns for execution.
>>>>>>
>>>>>> Specifically, runners will need to be able to checkpoint unbounded
>>>>>> sources to get a minimum viable pipeline working.
>>>>>> For bounded pipelines, a DoFn can read the contents of a bounded
>>>>>> source.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 10:52 AM Thomas Weise  wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm looking at the portable pipeline translation for streaming. I
>>>>>>> understand that for batch pipelines, it is sufficient to translate 
>>>>>>> Impulse.
>>>>>>>
>>>>>>> What is the intended path to support unbounded sources?
>>>>>>>
>>>>>>> The goal here is to get a minimum translation working that will
>>>>>>> allow streaming wordcount execution.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>

Re: ErrorProne and -Werror enabled for all Java projects

2018-06-27 Thread Eugene Kirpichov

This is awesome, thanks to everybody involved! It's so good to have
./gradlew compileJava compileTestJava not produce heaps of warnings like it
used to.

On Wed, Jun 27, 2018 at 9:52 AM Andrew Pilloud  wrote:

> Looking at the diff I think you can replace "Default Setting" with "Only
> Setting". This is Awesome! Thanks guys!
>
> Andrew
>
> On Wed, Jun 27, 2018 at 9:50 AM Kenneth Knowles  wrote:
>
>> Awesome! Can we remove the ability to disable it? :-) :-) :-) or anyhow
>> make it more obscure, not like an expected top-level config choice.
>>
>> Kenn
>>
>> On Wed, Jun 27, 2018 at 9:45 AM Tim  wrote:
>>
>>> Thanks also to you Scott
>>>
>>> Tim
>>>
>>> On 27 Jun 2018, at 18:39, Scott Wegner  wrote:
>>>
>>> Six weeks ago [1] we began an effort to improve the quality of the Java
>>> codebase via ErrorProne static analysis, and promoting compiler warnings to
>>> errors. As of today, all of our Java projects have been migrated and this
>>> is now the default setting for Beam [2].
>>>
>>> This was a community effort. The cleanup spanned 48 JIRA issues [3] and
>>> 46 pull requests [4]. I want to give a big thanks to everyone who helped
>>> out: Ismaël Mejía, Tim Robertson, Cade Markegard, and Teng Peng.
>>>
>>> Thanks!
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/cdc729b6349f952d8db78bae99fff74b06b60918cbe09344e075ba35@%3Cdev.beam.apache.org%3E
>>> 
>>> [2] https://github.com/apache/beam/pull/5773
>>> [3]
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20errorprone
>>>
>>> [4]
>>> https://github.com/apache/beam/pulls?utf8=%E2%9C%93=is%3Apr+errorprone+merged%3A%3E%3D2018-05-16+
>>>
>>>
>>>

Re: [DISCUSS] Automation for Java code formatting

2018-06-26 Thread Eugene Kirpichov

+1!

In some cases the temptation to format code manually can be quite strong,
but the ease of just re-running the formatter after any change (especially
after global changes like class/method renames) overweighs it. I lost count
of the times when I wasted a precommit because some line became >100
characters after a refactoring. I especially love that there's a gradle
task that does this for you - I used to manually run
google-java-format-diff.

On Tue, Jun 26, 2018 at 9:38 PM Rafael Fernandez 
wrote:

> +1! Remove guesswork :D
>
>
>
> On Tue, Jun 26, 2018 at 9:15 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I like readable code, but I don't like formatting it myself. And I
>> _really_ don't like discussing in code review. "Spotless" [1] can enforce -
>> and automatically apply - automatic formatting for Java, Groovy, and some
>> others.
>>
>> This is not about style or wanting a particular layout. This is about
>> automation, contributor experience, and streamlining review
>>
>>  - Contributor experience: MUCH better than checkstyle: error message
>> just says "run ./gradlew :beam-your-module:spotlessApply" instead of
>> telling them to go in and manually edit.
>>
>>  - Automation: You want to use autoformat so you don't have to format
>> code by hand. But if you autoformat a file that was in some other format,
>> then you touch a bunch of unrelated lines. If the file is already
>> autoformatted, it is much better.
>>
>>  - Review: Never talk about code formatting ever again. A PR also needs
>> baseline to already be autoformatted or formatting will make it unclear
>> which lines are really changed.
>>
>> This is already available via applyJavaNature(enableSpotless: true) and
>> it is turned on for SQL and our buildSrc gradle plugins. It is very nice.
>> There is a JIRA [2] to turn it on for the hold code base. Personally, I
>> think (a) every module could make a different choice if the main
>> contributors feel strongly and (b) it is objectively better to always
>> autoformat :-)
>>
>> WDYT? If we do it, it is trivial to add it module-at-a-time or globally.
>> If someone conflicts with a massive autoformat commit, they can just keep
>> their changes and autoformat them and it is done.
>>
>> Kenn
>>
>> [1] https://github.com/diffplug/spotless/tree/master/plugin-gradle
>> [2] https://issues.apache.org/jira/browse/BEAM-4394
>>
>>

Re: bad logger import?

2018-06-26 Thread Eugene Kirpichov

This is definitely a typo. Feel free to send PRs to correct these on sight
:)

On Tue, Jun 26, 2018 at 8:51 AM Rafael Fernandez 
wrote:

> Filed https://issues.apache.org/jira/browse/BEAM-4644 for this. I
> assigned it to +Ankur Goenka  because it's the first
> name in history :p (please reroute where appropriate).
>
> Thanks!
> r
>
> On Tue, Jun 26, 2018 at 8:23 AM Lukasz Cwik  wrote:
>
>> That is an internal class to the Flink runner. Runners are allowed to
>> choose whichever logging framework they want to use with the understanding
>> that the SDK and shared libraries use SLF4J but most likely its a simple
>> typo.
>>
>> On Tue, Jun 26, 2018 at 7:22 AM Kenneth Knowles  wrote:
>>
>>> Seems like a legit bug to me. Perhaps we can adjust checkstyle, or some
>>> other more semantic analysis, to forbid it.
>>>
>>> Kenn
>>>
>>> On Tue, Jun 26, 2018 at 6:48 AM Rafael Fernandez 
>>> wrote:
>>>
 +Lukasz Cwik  , +Henning Rohde 

 On Tue, Jun 26, 2018 at 1:25 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> Hi guys,
>
> answering a question on slack i realized flink
> ExecutableStageDoFnOperator.java uses JUL instead of SLF4J, not sure it is
> intended so thought I would mention it here.
>
> Side note: archetype and some test code does as well but it is less an
> issue.
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>

Re: Unbounded source translation for portable pipelines

2018-06-25 Thread Eugene Kirpichov

Hi!

Wanted to let you know that I've just merged the PR that adds
checkpointable SDF support to the portable reference runner (ULR) and the
Java SDK harness:

https://github.com/apache/beam/pull/5566

So now we have a reference implementation of SDF support in a portable
runner, and a reference implementation of SDF support in a portable SDK
harness.
>From here on, we need to replicate this support in other portable runners
and other harnesses. The obvious targets are Flink and Python respectively.

Chamikara was going to work on the Python harness. +Thomas Weise
 Would you be interested in the Flink portable streaming
runner side? It is of course blocked by having the rest of that runner
working in streaming mode though (the batch mode is practically done - will
send you a separate note about the status of that).

On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov 
wrote:

> Luke is right - unbounded sources should go through SDF. I am currently
> working on adding such support to Fn API.
> The relevant document is s.apache.org/beam-breaking-fusion (note: it
> focuses on a much more general case, but also considers in detail the
> specific case of running unbounded sources on Fn API), and the first
> related PR is https://github.com/apache/beam/pull/4743 .
>
> Ways you can help speed up this effort:
> - Make necessary changes to Apex runner per se to support regular SDFs in
> streaming (without portability). They will likely largely carry over to
> portable world. I recall that the Apex runner had some level of support of
> SDFs, but didn't pass the ValidatesRunner tests yet.
> - (general to Beam, not Apex-related per se) Implement the translation of
> Read.from(UnboundedSource) via impulse, which will require implementing an
> SDF that reads from a given UnboundedSource (taking the UnboundedSource as
> an element). This should be fairly straightforward and will allow all
> portable runners to take advantage of existing UnboundedSource's.
>
>
> On Fri, Mar 23, 2018 at 3:08 PM Lukasz Cwik  wrote:
>
>> Using impulse is a precursor for both bounded and unbounded SDF.
>>
>> This JIRA represents the work that would be to add support for unbounded
>> SDF using portability APIs:
>> https://issues.apache.org/jira/browse/BEAM-2939
>>
>>
>> On Fri, Mar 23, 2018 at 11:46 AM Thomas Weise  wrote:
>>
>>> So for streaming, we will need the Impulse translation for bounded
>>> input, identical with batch, and then in addition to that support for SDF?
>>>
>>> Any pointers what's involved in adding the SDF support? Is it runner
>>> specific? Does the ULR cover it?
>>>
>>>
>>> On Fri, Mar 23, 2018 at 11:26 AM, Lukasz Cwik  wrote:
>>>
>>>> All "sources" in portability will use splittable DoFns for execution.
>>>>
>>>> Specifically, runners will need to be able to checkpoint unbounded
>>>> sources to get a minimum viable pipeline working.
>>>> For bounded pipelines, a DoFn can read the contents of a bounded source.
>>>>
>>>>
>>>> On Fri, Mar 23, 2018 at 10:52 AM Thomas Weise  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm looking at the portable pipeline translation for streaming. I
>>>>> understand that for batch pipelines, it is sufficient to translate 
>>>>> Impulse.
>>>>>
>>>>> What is the intended path to support unbounded sources?
>>>>>
>>>>> The goal here is to get a minimum translation working that will allow
>>>>> streaming wordcount execution.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>

Re: "retest this please" no longer working on the beam site repo

2018-06-21 Thread Eugene Kirpichov

It's quite often not working on the main repo either :-/

On Thu, Jun 21, 2018 at 9:59 AM Reuven Lax  wrote:

> Does anyone know why this functionality isn't working?
>
> Reuven
>

Re: Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread Eugene Kirpichov

Very cool!

On Fri, Jun 15, 2018 at 10:56 AM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> 
>
> On Fri, Jun 15, 2018 at 1:50 PM, Griselda Cuevas  wrote:
>
>> Someone in my team edited some Open-Source-Projects' logos to celebrate
>> pride and Apache Beam was included!
>>
>>
>> I'm attaching what she did... sprinkling some fun in the mailing list,
>> because it's Friday!
>>
>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>

Re: [CANCEL][VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-13 Thread Eugene Kirpichov

FWIW I have a fix to the flaky test in
https://github.com/apache/beam/pull/5585 (open)

On Wed, Jun 13, 2018 at 5:26 PM Udi Meiri  wrote:

> +1 to ignoring flaky test.
>
> FYI there's a fourth cherrypick: https://github.com/apache/beam/pull/5624
>
> On Wed, Jun 13, 2018 at 3:45 PM Pablo Estrada  wrote:
>
>> Sent out https://github.com/apache/beam/pull/5640 to ignore the flaky
>> test. As JB is the release manager, I'l let him make the call on what to do
>> about it.
>> Best
>> -P.
>>
>> On Wed, Jun 13, 2018 at 3:34 PM Ahmet Altay  wrote:
>>
>>> I would vote for second option, not a release blocker and disable the
>>> test in the release branch. My reasoning is:
>>> - ReferenceRunner is not yet the official alternative to existing direct
>>> runners.
>>> - It is bad to have flaky tests on the release branch, and we would not
>>> get good signal during validation.
>>>
>>> On Wed, Jun 13, 2018 at 3:14 PM, Pablo Estrada 
>>> wrote:
>>>
 Hello all,
 cherrypicks for the release branch seem to be going well, but thanks to
 them we were able to surface a flaky test in the release branch. JIRA is
 filed: https://issues.apache.org/jira/projects/BEAM/issues/BEAM-4558

 Given that test issue, I see the following options:
 - Consider that this test is not a release blocker. Go ahead with RC2
 after cherrypicks are brought in, or
 - Consider that this test is not a release blocker, so we disable it
 before cutting RC2.
 - Consider this test a release blocker, and triage the bug for fixing.

 What do you think?

 Best
 -P.

 On Wed, Jun 13, 2018 at 9:54 AM Pablo Estrada 
 wrote:

> Precommits for PR https://github.com/apache/beam/pull/5609 are now
> passing. For now I've simply set failOnWarning to false to cherrypick into
> the release, and fix in master later on.
> Best
> -P.
>
> On Wed, Jun 13, 2018 at 9:08 AM Scott Wegner 
> wrote:
>
>> From my understanding, the @SuppressFBWarnings usage is in a
>> dependency (ByteBuddy) rather than directly in our code; so we're not 
>> able
>> to modify the usage.
>>
>> Pablo, feel free to disable failOnWarning for the sdks-java-core
>> project temporarily. This isn't a major regression since we've only
>> recently made the change to enable it [1]. We can work separately on
>> figuring out how to resolve the warnings.
>>
>> [1] https://github.com/apache/beam/pull/5319
>>
>> On Tue, Jun 12, 2018 at 11:57 PM Tim Robertson <
>> timrobertson...@gmail.com> wrote:
>>
>>> Hi Pablo,
>>>
>>> I'm afraid I couldn't find one either... there is an issue about it
>>> [1] which is old so it doesn't look likely to be resolved either.
>>>
>>> If you have time (sorry I am a bit busy) could you please verify the
>>> version does work if you install that version locally? I know the maven
>>> version of that [2] but not sure on the gradle equivalent. If we know it
>>> works, we can then find a repository that fits ok with Apache/Beam 
>>> policy.
>>>
>>> Alternatively, we could consider using a fully qualified reference
>>> (i.e. @edu.umd.cs.findbugs.annotations.SuppressWarnings) to the 
>>> deprecated
>>> version and leave the dependency at the 1.3.9-1. I believe our general
>>> direction is to remove findbugs when errorprone covers all aspects so I
>>> *expect* this should be considered reasonable.
>>>
>>> I hope this helps,
>>> Tim
>>>
>>> [1] https://github.com/stephenc/findbugs-annotations/issues/4
>>> [2]
>>> https://maven.apache.org/guides/mini/guide-3rd-party-jars-local.html
>>>
>>> On Wed, Jun 13, 2018 at 8:39 AM, Pablo Estrada 
>>> wrote:
>>>
 Hi Tim,
 you're right. Thanks for pointing that out. There's just one
 problem that I'm running into now: The 3.0.1-1 version does not seem 
 to be
 available in Maven Central[1]. Looking at the website, I am not quite 
 sure
 if there's another repository where they do stage the newer 
 versions?[2]

 -P

 [1]
 https://repo.maven.apache.org/maven2/com/github/stephenc/findbugs/findbugs-annotations
 /
 [2] http://stephenc.github.io/findbugs-annotations/

 On Tue, Jun 12, 2018 at 11:10 PM Tim Robertson <
 timrobertson...@gmail.com> wrote:

> Hi Pablo,
>
> I took only a quick look.
>
> "- The JAR from the non-LGPL findbugs does not contain the
> SuppressFBWarnings annotation"
>
> Unless I misunderstand you it looks like SuppressFBWarnings was
> added in Stephen's version in this commit [1] which was
> introduced in version 2.0.3-1 -  I've checked is in the 3.0.1-1 build 
> [2]
> I notice in your commits [1] you've been exploring

Re: Proposing interactive beam runner

2018-06-13 Thread Eugene Kirpichov

This is awesome, thanks Sindy! I hope that the questions related to
portability will get resolved in a way that will allow to reuse some of the
work for other interactive Beam experiences, including SQL as Andrew says,
and providing a REPL e.g. for users of Scala or other JVM-based languages.

+Neville Li  Do I remember correctly that you guys had
some sort of interactivity going in Scio but were looking forward to Beam
developing a native solution?

On Wed, Jun 13, 2018 at 2:22 PM Sindy Li  wrote:

> *Thanks, Andrew!*
>
> *Here is a link to the demo on Youtube for people interested:*
> *https://www.youtube.com/watch?v=c5CjA1e3Cqw=youtu.be
> *
>
> On Wed, Jun 13, 2018 at 1:23 PM, Andrew Pilloud 
> wrote:
>
>> This sounds really interesting, thanks for sharing! We've just begun to
>> explore making Beam SQL interactive. The Interactive Runner you've proposed
>> sounds like it would solve a bunch of the problems SQL faces as well. SQL
>> is written in Java right now, so we can't immediately reuse any code.
>>
>> Andrew
>>
>> On Wed, Jun 13, 2018 at 11:48 AM Sindy Li  wrote:
>>
>>> Resending after subscribing to dev list.
>>>
>>> -- Forwarded message --
>>> From: Sindy Li 
>>> Date: Fri, Jun 8, 2018 at 5:57 PM
>>> Subject: Proposing interactive beam runner
>>> To: dev@beam.apache.org
>>> Cc: Harsh Vardhan , Chamikara Jayalath <
>>> chamik...@google.com>, Anand Iyer , Robert Bradshaw <
>>> rober...@google.com>
>>>
>>>
>>> Hello,
>>>
>>> We were exploring ways to provide an interactive notebook experience for
>>> writing Beam Python pipelines. The design doc
>>> 
>>>  provides
>>> an overview/vision of what we would like to achieve. Pull request
>>>  provides a prototype for the
>>> same. The document also provides demo screen shots and instructions for
>>> running a demo in Jupyter. Please take a look. We believe this would be a
>>> useful addition to Beam.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Eugene Kirpichov

Is it possible for Dataflow to just keep a copy of the pom.xmls and delete
it as soon as Dataflow is migrated?

Overall +1, I've been using Gradle without issues for a while and almost
forgot pom.xml's still existed.

On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:

> I agree that we should delete the pom.xml files soon, as they create a
> burden for maintainers.
>
> I'd like to be able to extend the grace period by a bit, to allow the
> internal build systems at Google to move away from using the Beam poms.
>
> We use these pom files to build Dataflow workers, and thus it's critical
> for us that they are available for a few more weeks while we set up a
> gradle build. Perhaps 4 weeks?
> (Calling out+Chamikara Jayalath  who has recently
> worked on internal Dataflow tooling.)
>
> Best
> -P.
>
> On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:
>
>> Note: Apache Beam will still provide pom.xml for each release it
>> produces. This is only about people using Maven to build Apache Beam
>> themselves and not relying on the released artifacts in Maven Central.
>>
>> With the first release using Gradle as the build system is underway, I
>> wanted to start this thread to remind people that we are going to delete
>> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
>> week grace period.
>>
>> Are there others who would like a shorter/longer grace period?
>>
>> The PR to delete the pom.xml is here:
>> https://github.com/apache/beam/pull/5571
>>
> --
> Got feedback? go/pabloem-feedback
> 
>

Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Eugene Kirpichov

On Thu, May 31, 2018 at 2:56 PM Ismaël Mejía  wrote:

> If I understood correctly what is proposed is:
>
> - Committers to be able to have their PRs reviewed by non-committers
> and be able to self-merge.
> - For non-committers nothing changes.
>
I think it is being proposed that a non-committer can start their review
with a non-committer reviewer? Of course they still need to involve a
committer to merge.


>
> This enables a committer (wearing contributor head) to merge their own
> changes without committer approval, so we should ensure that no
> shortcuts are taken just to get things done quickly.
>
> I think as Thomas Weise mentioned that mentoring and encouraging more
> contributors to become committers is a better long term solution to
> this issue.
> On Thu, May 31, 2018 at 11:24 PM Eugene Kirpichov 
> wrote:
> >
> > Agreed with all said above - as I understand it, we have consensus on
> the following:
> >
> > Whether you're a committer or not:
> > - Find somebody who's familiar with the code and ask them to review. Use
> your best judgment in whose review would give you good confidence that your
> code is actually good.
> > (it's a well-known problem that for new contributors it's often
> difficult to choose a proper reviewer - we should discuss that separately)
> >
> > If you're a committer:
> > - Once the reviewers are happy, you can merge the change yourself.
> >
> > If you're not a committer:
> > - Once the reviewers are happy: if one of them is a commiter, you're
> done; otherwise, involve a committer. They may give comments, including
> possibly very substantial comments.
> > - To minimize the chance of the latter: if your change is potentially
> controversial, involve a committer early on, or involve the dev@ mailing
> list, write a design doc etc.
> >
> > On Thu, May 31, 2018 at 2:16 PM Kenneth Knowles  wrote:
> >>
> >> Seems like enough consensus, and that this is a policy thing that
> should have an official vote.
> >>
> >> On Thu, May 31, 2018 at 12:01 PM Robert Bradshaw 
> wrote:
> >>>
> >>> +1, this is what I was going to propose.
> >>>
> >>> Code review serves two related, but distinct purposes. The first is
> just getting a second set of eyes on the code to improve quality (call this
> the LGTM). This can be done by anyone. The second is vetting whether this
> contribution, in its current form, should be included in beam (call this
> the approval), and is clearly in the purview, almost by definition, of the
> committers. Often the same person can do both, but that's not required
> (e.g. this is how reviews are handled internally at Google).
> >>>
> >>> I think we should trust committers to be able to give (or if a large
> change, seek, perhaps on the list, as we do now) approval for their own
> change. (Eventually we could delegate different approvers for different
> subcomponents, rather than have every committer own everything.)
> Regardless, we still want LGTMs for all changes. It can also make sense for
> a non-committer to give an LGTM on another non-committers's code, and an
> approver to take this into account, to whatever level at their discretion,
> when approving code. Much of Go was developed this way out of necessity.
> >>>
> >>> I also want to point out that having non-committers review code helps
> more than reducing load: it's a good opportunity for non-committers to get
> to know the codebase (for both technical understandings and conventions),
> interact with the community members, and make non-trivial contributions.
> Reviewing code from a committer is especially valuable on all these points.
> >>>
> >>> - Robert
> >>>
> >>>
> >>> On Thu, May 31, 2018 at 11:35 AM Pablo Estrada 
> wrote:
> >>>>
> >>>> In that case, does it make sense to say:
> >>>>
> >>>> - A code review by a committer is enough to merge.
> >>>> - Committers can have their PRs reviewed by non-committers that are
> familiar with the code
> >>>> - Non-committers may have their code reviewed by non-committers, but
> should have a committer do a lightweight review before merging.
> >>>>
> >>>> Do these seem like reasonable principles?
> >>>> -P.
> >>>>
> >>>> On Thu, May 31, 2018 at 11:25 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >>>>>
> >>>>> In that case, the contributor should be a committer pretty fast.
> >>>>>
> >>>>> I would pre

Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Eugene Kirpichov

Agreed with all said above - as I understand it, we have consensus on the
following:

Whether you're a committer or not:
- Find somebody who's familiar with the code and ask them to review. Use
your best judgment in whose review would give you good confidence that your
code is actually good.
(it's a well-known problem that for new contributors it's often difficult
to choose a proper reviewer - we should discuss that separately)

If you're a committer:
- Once the reviewers are happy, you can merge the change yourself.

If you're not a committer:
- Once the reviewers are happy: if one of them is a commiter, you're done;
otherwise, involve a committer. They may give comments, including possibly
very substantial comments.
- To minimize the chance of the latter: if your change is potentially
controversial, involve a committer early on, or involve the dev@ mailing
list, write a design doc etc.

On Thu, May 31, 2018 at 2:16 PM Kenneth Knowles  wrote:

> Seems like enough consensus, and that this is a policy thing that should
> have an official vote.
>
> On Thu, May 31, 2018 at 12:01 PM Robert Bradshaw 
> wrote:
>
>> +1, this is what I was going to propose.
>>
>> Code review serves two related, but distinct purposes. The first is just
>> getting a second set of eyes on the code to improve quality (call this the
>> LGTM). This can be done by anyone. The second is vetting whether this
>> contribution, in its current form, should be included in beam (call this
>> the approval), and is clearly in the purview, almost by definition, of
>> the committers. Often the same person can do both, but that's not required
>> (e.g. this is how reviews are handled internally at Google).
>>
>> I think we should trust committers to be able to give (or if a large
>> change, seek, perhaps on the list, as we do now) approval for their own
>> change. (Eventually we could delegate different approvers for different
>> subcomponents, rather than have every committer own everything.)
>> Regardless, we still want LGTMs for all changes. It can also make sense for
>> a non-committer to give an LGTM on another non-committers's code, and an
>> approver to take this into account, to whatever level at their discretion,
>> when approving code. Much of Go was developed this way out of necessity.
>>
>> I also want to point out that having non-committers review code helps
>> more than reducing load: it's a good opportunity for non-committers to get
>> to know the codebase (for both technical understandings and conventions),
>> interact with the community members, and make non-trivial contributions.
>> Reviewing code from a committer is especially valuable on all these points.
>>
>> - Robert
>>
>>
>> On Thu, May 31, 2018 at 11:35 AM Pablo Estrada 
>> wrote:
>>
>>> In that case, does it make sense to say:
>>>
>>> - A code review by a committer is enough to merge.
>>> - Committers can have their PRs reviewed by non-committers that are
>>> familiar with the code
>>> - Non-committers may have their code reviewed by non-committers, but
>>> should have a committer do a lightweight review before merging.
>>>
>>> Do these seem like reasonable principles?
>>> -P.
>>>
>>> On Thu, May 31, 2018 at 11:25 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 In that case, the contributor should be a committer pretty fast.

 I would prefer to keep at least a final validation from a committer to
 guarantee the consistency of the project and anyway, only committer role
 can merge a PR.
 However, I fully agree that the most important is the Beam community. I
 have no problem that non committer does the review and ask a committer
 for final one and merge.

 Regards
 JB

 On 31/05/2018 19:33, Andrew Pilloud wrote:
 > If someone is trusted enough to review a committers code shouldn't
 they
 > also be trusted enough to review another contributors code? As a
 > non-committer I would get much quicker reviews if I could have other
 > non-committers do the review, then get a committer who trusts us to
 merge.
 >
 > Andrew
 >
 > On Thu, May 31, 2018 at 9:03 AM Henning Rohde >>> > > wrote:
 >
 > +1
 >
 > On Thu, May 31, 2018 at 8:55 AM Thomas Weise >>> > > wrote:
 >
 > +1 to the goal of increasing review bandwidth
 >
 > In addition to the proposed reviewer requirement change,
 perhaps
 > there are other ways to contribute towards that goal as well?
 >
 > The discussion so far has focused on how more work can get
 done
 > with the same pool of committers or how committers can get
 their
 > work done faster. But ASF is really about "community over
 code"
 > and in that spirit maybe we can consider how community growth
 > can lead to similar effects? One way I can think of is that
 > besides

Re: The full list of proposals / prototype documents

2018-05-31 Thread Eugene Kirpichov

;>> I'm *not* going to copy the content of this docs to one page or
>>>>>>>>> even web site, let’s keep this as it is, no changes here for the 
>>>>>>>>> moment. I
>>>>>>>>> think, moving to something else than Google docs is a tough question 
>>>>>>>>> and
>>>>>>>>> requires another discussion.
>>>>>>>>>
>>>>>>>>> So, in this case, this task seems not so hard since we don’t add
>>>>>>>>> such docs too often - I'll just have to update this index page on web 
>>>>>>>>> site.
>>>>>>>>> In addition, the authors will be always welcomed to update this page 
>>>>>>>>> by
>>>>>>>>> themselves. In my turn, I’ll try to keep an eye on this to keep it 
>>>>>>>>> synced.
>>>>>>>>> And of course, any help will be welcomed too =)
>>>>>>>>>
>>>>>>>>> WBR,
>>>>>>>>> Alexey
>>>>>>>>>
>>>>>>>>> On 24 May 2018, at 00:01, Griselda Cuevas  wrote:
>>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> @Alexey, I think this is a great idea, I'd like to understand more
>>>>>>>>> of the motivation behind having all the designs doc under a single 
>>>>>>>>> page. In
>>>>>>>>> my opinion it could become a challenge to maintain a page, so knowing 
>>>>>>>>> what
>>>>>>>>> you want to accomplish could help us think of alternative solutions?
>>>>>>>>>
>>>>>>>>> On Wed, 23 May 2018 at 14:08, Daniel Oliveira <
>>>>>>>>> danolive...@google.com> wrote:
>>>>>>>>>
>>>>>>>>> +1 to web site page (not Google Doc).
>>>>>>>>>
>>>>>>>>> Definitely agree that a common entry point would be excellent. I
>>>>>>>>> don't like the idea of the Google Doc so much because it's not very 
>>>>>>>>> good
>>>>>>>>> for having changes reviewed and keeping track of who added what, 
>>>>>>>>> unlike
>>>>>>>>> Github. Adding an entry to the list in the website would require 
>>>>>>>>> reviews
>>>>>>>>> and leave behind a commit history, which I think is important for an
>>>>>>>>> authoritative source like this.
>>>>>>>>>
>>>>>>>>> PS: I also have a doc I proposed that I didn't see in the lists:
>>>>>>>>> https://s.apache.org/beam-runner-api-combine-model
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 12:52 PM Lukasz Cwik 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> +1, Thanks for picking this up Alexey
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 10:41 AM Huygaa Batsaikhan <
>>>>>>>>> bat...@google.com> wrote:
>>>>>>>>>
>>>>>>>>> +1. That is great, Alexey. Robin and I are working on documenting
>>>>>>>>> some missing pieces of Java SDK. We will let you know when we create
>>>>>>>>> polished documents.
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 9:28 AM Ismaël Mejía 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> +1 and thanks for volunteering for this Alexey.
>>>>>>>>> We really need to make this more accesible.
>>>>>>>>> On Wed, May 23, 2018 at 6:00 PM Alexey Romanenko <
>>>>>>>>> aromanenko@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > Joseph, Eugene - thank you very much for the links!
>>>>>>>>>
>>>>>>>>> > All, regarding one common entry point for all design documents.
>>>>>>>>> Could we
>>>>>>>>> just have a dedicated page on Beam web site with a list of links
>>>>>>>>> to every
>>>>>>>>> proposed document? Every entry (optionally) might contain, in
>>>>>>>>> addition,
>>>>>>>>> short abstract and list of author(s). In this case, it would be
>>>>>>>>> easily
>>>>>>>>> searchable and available for those who are interested in this.
>>>>>>>>>
>>>>>>>>> > In the same time, using a Google doc for writing/discussing the
>>>>>>>>> documents
>>>>>>>>> seems more than reasonable since it’s quite native and easy to
>>>>>>>>> use. I only
>>>>>>>>> propose to have a common entry point to fall of them.
>>>>>>>>>
>>>>>>>>> > If this idea looks feasible, I’d propose myself to collect the
>>>>>>>>> links to
>>>>>>>>> already created documents, create such page and update this list
>>>>>>>>> in the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>> > WBR,
>>>>>>>>> > Alexey
>>>>>>>>>
>>>>>>>>> > On 22 May 2018, at 21:34, Eugene Kirpichov 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > Making it easier to manage indeed would be good. Could someone
>>>>>>>>> from PMC
>>>>>>>>> please add the following documents of mine to it?
>>>>>>>>>
>>>>>>>>> > SDF related documents:
>>>>>>>>> > http://s.apache.org/splittable-do-fn
>>>>>>>>> > http://s.apache.org/sdf-via-source
>>>>>>>>> > http://s.apache.org/textio-sdf
>>>>>>>>> > http://s.apache.org/beam-watch-transform
>>>>>>>>> > http://s.apache.org/beam-breaking-fusion
>>>>>>>>>
>>>>>>>>> > Non SDF related:
>>>>>>>>> > http://s.apache.org/context-fn
>>>>>>>>> > http://s.apache.org/fileio-write
>>>>>>>>>
>>>>>>>>> > A suggestion: maybe we can establish a convention to send design
>>>>>>>>> document
>>>>>>>>> proposals to dev+desi...@beam.apache.org? Does the Apache mailing
>>>>>>>>> list
>>>>>>>>> management software support this kind of stuff? Then they'd be
>>>>>>>>> quite easy
>>>>>>>>> to find and filter.
>>>>>>>>>
>>>>>>>>> > On Tue, May 22, 2018 at 10:57 AM Kenneth Knowles 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> >> It is owned by the Beam PMC collectively. Any PMC member can
>>>>>>>>> add things
>>>>>>>>> to it. Ideas for making it easy to manage are welcome.
>>>>>>>>>
>>>>>>>>> >> Probably easier to have a markdown file somewhere with a list
>>>>>>>>> of docs so
>>>>>>>>> we can issue and review PRs. Not sure the web site is the right
>>>>>>>>> place for
>>>>>>>>> it - we have a history of porting docs to markdown but really that
>>>>>>>>> is high
>>>>>>>>> overhead and users/community probably don't gain from it so much.
>>>>>>>>> Some have
>>>>>>>>> suggested a wiki.
>>>>>>>>>
>>>>>>>>> >> Kenn
>>>>>>>>>
>>>>>>>>> >> On Tue, May 22, 2018 at 10:22 AM Scott Wegner <
>>>>>>>>> sweg...@google.com> wrote:
>>>>>>>>>
>>>>>>>>> >>> Thanks for the links. Any details on that Google drive folder?
>>>>>>>>> Who
>>>>>>>>> maintains it? Is it possible for any contributor to add their
>>>>>>>>> design doc?
>>>>>>>>>
>>>>>>>>> >>> On Mon, May 21, 2018 at 8:15 AM Joseph PENG <
>>>>>>>>> josephtengp...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> >>>> Alexey,
>>>>>>>>>
>>>>>>>>> >>>> I do not know where you can find all design docs, but I know
>>>>>>>>> a blog
>>>>>>>>> that has collected some of the major design docs. Hope it helps.
>>>>>>>>>
>>>>>>>>> >>>> https://wtanaka.com/beam/design-doc
>>>>>>>>>
>>>>>>>>> >>>>
>>>>>>>>> https://drive.google.com/drive/folders/0B-IhJZh9Ab52OFBVZHpsNjc4eXc
>>>>>>>>>
>>>>>>>>> >>>> On Mon, May 21, 2018 at 9:28 AM Alexey Romanenko <
>>>>>>>>> aromanenko@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> >>>>> Hi all,
>>>>>>>>>
>>>>>>>>> >>>>> Is it possible to obtain somewhere a list of all proposals /
>>>>>>>>> prototype documents that have been published as a technical /
>>>>>>>>> design
>>>>>>>>> documents for new features? I have links to only some of them
>>>>>>>>> (found in
>>>>>>>>> mail list discussions by chance) but I’m not aware of others.
>>>>>>>>>
>>>>>>>>> >>>>> If yes, could someone share it or point me out where it is
>>>>>>>>> located in
>>>>>>>>> case if I missed this?
>>>>>>>>>
>>>>>>>>> >>>>> If not, don’t you think it would make sense to have such
>>>>>>>>> index of
>>>>>>>>> these documents? I believe it can be useful for Beam contributors
>>>>>>>>> since
>>>>>>>>> these proposals contain information which is absent or not so
>>>>>>>>> detailed on
>>>>>>>>> Beam web site documentation.
>>>>>>>>>
>>>>>>>>> >>>>> WBR,
>>>>>>>>> >>>>> Alexey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
>

Re: Launching a Portable Pipeline

2018-05-22 Thread Eugene Kirpichov

Thanks Ankur, I think there's consensus, so it's probably ready to share :)

On Fri, May 18, 2018 at 3:00 PM Ankur Goenka <goe...@google.com> wrote:

> Thanks for all the input.
> I have summarized the discussions at the bottom of the document ( here
> <https://docs.google.com/document/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit#heading=h.lky5ef6wxo9x>
> ).
> Please feel free to provide comments.
> Once we agree, I will publish the conclusion on the mailing list.
>
> On Mon, May 14, 2018 at 1:51 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Thanks Ankur, this document clarifies a few points and raises some very
>> important questions. I encourage everybody with a stake in Portability to
>> take a look and chime in.
>>
>> +Aljoscha Krettek <aljos...@data-artisans.com> +Thomas Weise
>> <t...@apache.org> +Henning Rohde <hero...@google.com>
>>
>> On Mon, May 14, 2018 at 12:34 PM Ankur Goenka <goe...@google.com> wrote:
>>
>>> Updated link
>>> <https://docs.google.com/document/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit>
>>>  to
>>> the document as the previous link was not working for some people.
>>>
>>>
>>> On Fri, May 11, 2018 at 7:56 PM Ankur Goenka <goe...@google.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Recent effort on portability has introduced JobService and
>>>> ArtifactService to the beam stack along with SDK. This has open up a few
>>>> questions around how we start a pipeline in a portable setup (with
>>>> JobService).
>>>> I am trying to document our approach to launching a portable pipeline
>>>> and take binding decisions based on the discussion.
>>>> Please review the document and provide your feedback.
>>>>
>>>> Thanks,
>>>> Ankur
>>>>
>>>

Re: The full list of proposals / prototype documents

2018-05-22 Thread Eugene Kirpichov

Making it easier to manage indeed would be good. Could someone from PMC
please add the following documents of mine to it?

SDF related documents:
http://s.apache.org/splittable-do-fn
http://s.apache.org/sdf-via-source
http://s.apache.org/textio-sdf 
http://s.apache.org/beam-watch-transform
http://s.apache.org/beam-breaking-fusion

Non SDF related:
http://s.apache.org/context-fn
http://s.apache.org/fileio-write

A suggestion: maybe we can establish a convention to send design document
proposals to dev+desi...@beam.apache.org? Does the Apache mailing list
management software support this kind of stuff? Then they'd be quite easy
to find and filter.

On Tue, May 22, 2018 at 10:57 AM Kenneth Knowles  wrote:

> It is owned by the Beam PMC collectively. Any PMC member can add things to
> it. Ideas for making it easy to manage are welcome.
>
> Probably easier to have a markdown file somewhere with a list of docs so
> we can issue and review PRs. Not sure the web site is the right place for
> it - we have a history of porting docs to markdown but really that is high
> overhead and users/community probably don't gain from it so much. Some have
> suggested a wiki.
>
> Kenn
>
> On Tue, May 22, 2018 at 10:22 AM Scott Wegner  wrote:
>
>> Thanks for the links. Any details on that Google drive folder? Who
>> maintains it? Is it possible for any contributor to add their design doc?
>>
>> On Mon, May 21, 2018 at 8:15 AM Joseph PENG 
>> wrote:
>>
>>> Alexey,
>>>
>>> I do not know where you can find all design docs, but I know a blog that
>>> has collected some of the major design docs. Hope it helps.
>>>
>>> https://wtanaka.com/beam/design-doc
>>>
>>> https://drive.google.com/drive/folders/0B-IhJZh9Ab52OFBVZHpsNjc4eXc
>>>
>>> On Mon, May 21, 2018 at 9:28 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Hi all,

 Is it possible to obtain somewhere a list of all proposals / prototype
 documents that have been published as a technical / design documents for
 new features? I have links to only some of them (found in mail list
 discussions by chance) but I’m not aware of others.

 If yes, could someone share it or point me out where it is located in
 case if I missed this?

 If not, don’t you think it would make sense to have such index of these
 documents? I believe it can be useful for Beam contributors since these
 proposals contain information which is absent or not so detailed on Beam
 web site documentation.

 WBR,
 Alexey
>>>
>>>

Re: Current progress on Portable runners

2018-05-22 Thread Eugene Kirpichov

Thanks all! Yeah, I'll update the Portability page with the status of this
project and other pointers this week or next (mostly out of office this
week).

On Fri, May 18, 2018 at 5:01 PM Thomas Weise <t...@apache.org> wrote:

> - Flink JobService: in review <https://github.com/apache/beam/pull/5262>
>
> That's TODO (above PR was merged, but it doesn't contain the Flink job
> service).
>
> Discussion about it is here:
> https://docs.google.com/document/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit?ts=5afa1238
>
> Thanks,
> Thomas
>
>
>
> On Fri, May 18, 2018 at 7:01 AM, Thomas Weise <t...@apache.org> wrote:
>
>> Most of it should probably go to
>> https://beam.apache.org/contribute/portability/
>>
>> Also for reference, here is the prototype doc: https://s.apache.org/beam-
>> portability-team-doc
>>
>> Thomas
>>
>> On Fri, May 18, 2018 at 5:35 AM, Kenneth Knowles <k...@google.com> wrote:
>>
>>> This is awesome. Would you be up for adding a brief description at
>>> https://beam.apache.org/contribute/#works-in-progress and maybe a
>>> pointer to a gdoc with something like the contents of this email? (my
>>> reasoning is (a) keep the contribution guide concise but (b) all this
>>> detail is helpful yet (c) the detail may be ever-changing so making a
>>> separate web page is not the best format)
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018 at 3:13 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> A little over a month ago, a large group of Beam community members has
>>>> been working a prototype of a portable Flink runner - that is, a runner
>>>> that can execute Beam pipelines on Flink via the Portability API
>>>> <https://s.apache.org/beam-runner-api>. The prototype was developed in
>>>> a separate branch
>>>> <https://github.com/bsidhom/beam/tree/hacking-job-server> and was
>>>> successfully demonstrated at Flink Forward, where it ran Python and Go
>>>> pipelines in a limited setting.
>>>>
>>>> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
>>>> Sidhom and myself) have been working on productionizing the prototype to
>>>> address its limitations and do things "the right way", preparing to reuse
>>>> this work for developing other portable runners (e.g. Spark). This involves
>>>> a surprising amount of work, since many important design and implementation
>>>> concerns could be ignored for the purposes of a prototype. I wanted to give
>>>> an update on where we stand now.
>>>>
>>>> Our immediate milestone in sight is *Run Java and Python batch
>>>> WordCount examples against a distributed remote Flink cluster*. That
>>>> involves a few moving parts, roughly in order of appearance:
>>>>
>>>> *Job submission:*
>>>> - The SDK is configured to use a "portable runner", whose
>>>> responsibility is to run the pipeline against a given JobService endpoint.
>>>> - The portable runner converts the pipeline to a portable Pipeline proto
>>>> - The runner finds out which artifacts it needs to stage, and staging
>>>> them against an ArtifactStagingService
>>>> - A Flink-specific JobService receives the Pipeline proto, performs
>>>> some optimizations (e.g. fusion) and translates it to Flink datasets and
>>>> functions
>>>>
>>>> *Job execution:*
>>>> - A Flink function executes a fused chain of Beam transforms (an
>>>> "executable stage") by converting the input and the stage to bundles and
>>>> executing them against an SDK harness
>>>> - The function starts the proper SDK harness, auxiliary services (e.g.
>>>> artifact retrieval, side input handling) and wires them together
>>>> - The function feeds the data to the harness and receives data back.
>>>>
>>>> *And here is our status of implementation for these parts:* basically,
>>>> almost everything is either done or in review.
>>>>
>>>> *Job submission:*
>>>> - General-purpose portable runner in the Python SDK: done
>>>> <https://github.com/apache/beam/pull/5301>; Java SDK: also done
>>>> <https://github.com/apache/beam/pull/5150>
>>>> - Artifact staging from the Python SDK: in review (PR
>>>> <https://github.com/apache/beam/pu

Re: [VOTE] Go SDK

2018-05-22 Thread Eugene Kirpichov

+1!

It is particularly exciting to me that the Go support is
"portability-first" and does everything in the proper "portability way"
from the start, free of legacy non-portable runner support code.

On Tue, May 22, 2018 at 11:32 AM Scott Wegner  wrote:

> +1 (non-binding)
>
> Having a third language will really force us to design Beam constructs in
> a language-agnostic way, and achieve the goals of portability. Thanks to
> all that have helped reach this milestone.
>
> On Tue, May 22, 2018 at 10:19 AM Ahmet Altay  wrote:
>
>> +1 (binding)
>>
>> Congratulations to the team!
>>
>> On Tue, May 22, 2018 at 10:13 AM, Alan Myrvold 
>> wrote:
>>
>>> +1 (non-binding)
>>> Nice work!
>>>
>>> On Tue, May 22, 2018 at 9:18 AM Pablo Estrada 
>>> wrote:
>>>
 +1 (binding)
 Very excited to see this!

 On Tue, May 22, 2018 at 9:09 AM Thomas Weise  wrote:

> +1 and congrats!
>
>
> On Tue, May 22, 2018 at 8:48 AM, Rafael Fernandez  > wrote:
>
>> +1 !
>>
>> On Tue, May 22, 2018 at 7:54 AM Lukasz Cwik  wrote:
>>
>>> +1 (binding)
>>>
>>> On Tue, May 22, 2018 at 6:16 AM Robert Burke 
>>> wrote:
>>>
 +1 (non-binding)

 I'm looking forward to helping gophers solve their big data
 problems in their language of choice, and runner of choice!

 Next stop, a non-java portability runner?

 On Tue, May 22, 2018, 6:08 AM Kenneth Knowles 
 wrote:

> +1 (binding)
>
> This is great. Feels like a phase change in the life of Apache
> Beam, having three languages, with multiple portable runners on the 
> horizon.
>
> Kenn
>
> On Tue, May 22, 2018 at 2:50 AM Ismaël Mejía 
> wrote:
>
>> +1 (binding)
>>
>> Go SDK brings new language support for a community not well
>> supported in
>> the Big Data world the Go developers, so this is a great. Also
>> the fact
>> that this is the first SDK integrated with the portability work
>> makes it an
>> interesting project to learn lessons from for future languages.
>>
>> Now it is the time to start building a community around the Go
>> SDK this is
>> the most important task now, and the only way to do it is to have
>> the SDK
>> as an official part of Beam so +1.
>>
>> Congrats to Henning and all the other contributors for this
>> important
>> milestone.
>> On Tue, May 22, 2018 at 10:21 AM Holden Karau <
>> hol...@pigscanfly.ca> wrote:
>>
>> > +1 (non-binding), I've had a chance to work with the SDK and
>> it's pretty
>> neat to see Beam add support for a language before the most of
>> the big data
>> ecosystem.
>>
>> > On Mon, May 21, 2018 at 10:29 PM, Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>> wrote:
>>
>> >> Hi Henning,
>>
>> >> SGA has been filed for the entire project during the
>> incubation period.
>>
>> >> Here, we have to check if SGA/IP donation is clean for the Go
>> SDK.
>>
>> >> We don't have a lot to do, just checked that we are clean on
>> this front.
>>
>> >> Regards
>> >> JB
>>
>> >> On 22/05/2018 06:42, Henning Rohde wrote:
>>
>> >>> Thanks everyone!
>>
>> >>> Davor -- regarding your two comments:
>> >>> * Robert mentioned that "SGA should have probably already
>> been
>> filed" in the previous thread. I got the impression that nothing
>> further
>> was needed. I'll follow up.
>> >>> * The standard Go tooling basically always pulls directly
>> from
>> github, so there is no real urgency here.
>>
>> >>> Thanks,
>> >>>Henning
>>
>>
>> >>> On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > wrote:
>>
>> >>>  +1 (binding)
>>
>> >>>  I just want to check about SGA/IP/Headers.
>>
>> >>>  Thanks !
>> >>>  Regards
>> >>>  JB
>>
>> >>>  On 22/05/2018 03:02, Henning Rohde wrote:
>> >>>   > Hi everyone,
>> >>>   >
>> >>>   > Now that the remaining issues have been resolved as
>> discussed,
>> >>>  I'd like
>> >>>   > to propose a formal vote on accepting the Go SDK into
>> master. The
>>

Re: What is the future of Reshuffle?

2018-05-18 Thread Eugene Kirpichov

Agreed that it should be undeprecated, many users are getting confused by
this.
I know that some people are working on a replacement for at least one of
its use cases (RequiresStableInput), but the use case of breaking fusion
is, as of yet, unaddressed, and there's not much to be gained by keeping it
deprecated.

On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:

> I am interested in more clarity on this as well. It has been deprecated
> for a long time without a replacement, and its usage has only grown, both
> within Beam code base as well as in user applications.
>
> If we are certain that it will not be removed before there is a good
> replacement for it, can we undeprecate it until there are proper plans for
> replacement?
>
> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>
>> I saw in a recent thread that the use of the Reshuffle transform was
>> recommended to solve an user issue:
>>
>>
>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>>
>> I can see why it may fix the reported issue. I am just curious about
>> the fact that the Reshuffle transform is marked as both @Internal and
>> @Deprecated in Beam's SDK.
>>
>> Do we have some alternative? So far the class documentation does not
>> recommend any replacement.
>>
>

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Eugene Kirpichov

Hi Ismael,
Impulse is a primitive necessary for the Portability world, where sources
do not exist. Impulse is the only possible root of the pipeline, it emits a
single empty byte array, and it's all DoFn's and SDF's from there. E.g.
when using Fn API, Read.from(BoundedSource) is translated into: Impulse +
ParDo(emit source) + ParDo(call .split()) + reshuffle + ParDo(call
.createReader() and read from it).
Agree that it makes sense to document it somewhere on the portability page.

On Fri, May 18, 2018 at 7:21 AM Jean-Baptiste Onofré 
wrote:

> Fully agree.
>
> I already started to take a look.
>
> Regards
> JB
>
> On 18/05/2018 16:12, Ismaël Mejía wrote:
> > I have seen multiple mentions of 'Impulse' in JIRAs and some on other
> > discussions, but have not seen any document or concrete explanation on
> > what's Impulse and why we need it. This seems like an internal
> > implementation detail but it is probably a good idea to explain it
> > somewhere (my excuses if this is in some document and I missed it).
> >
>

Current progress on Portable runners

2018-05-17 Thread Eugene Kirpichov

Hi all,

A little over a month ago, a large group of Beam community members has been
working a prototype of a portable Flink runner - that is, a runner that can
execute Beam pipelines on Flink via the Portability API
. The prototype was developed in
a separate
branch  and was
successfully demonstrated at Flink Forward, where it ran Python and Go
pipelines in a limited setting.

Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
Sidhom and myself) have been working on productionizing the prototype to
address its limitations and do things "the right way", preparing to reuse
this work for developing other portable runners (e.g. Spark). This involves
a surprising amount of work, since many important design and implementation
concerns could be ignored for the purposes of a prototype. I wanted to give
an update on where we stand now.

Our immediate milestone in sight is *Run Java and Python batch WordCount
examples against a distributed remote Flink cluster*. That involves a few
moving parts, roughly in order of appearance:

*Job submission:*
- The SDK is configured to use a "portable runner", whose responsibility is
to run the pipeline against a given JobService endpoint.
- The portable runner converts the pipeline to a portable Pipeline proto
- The runner finds out which artifacts it needs to stage, and staging them
against an ArtifactStagingService
- A Flink-specific JobService receives the Pipeline proto, performs some
optimizations (e.g. fusion) and translates it to Flink datasets and
functions

*Job execution:*
- A Flink function executes a fused chain of Beam transforms (an
"executable stage") by converting the input and the stage to bundles and
executing them against an SDK harness
- The function starts the proper SDK harness, auxiliary services (e.g.
artifact retrieval, side input handling) and wires them together
- The function feeds the data to the harness and receives data back.

*And here is our status of implementation for these parts:* basically,
almost everything is either done or in review.

*Job submission:*
- General-purpose portable runner in the Python SDK: done
; Java SDK: also done

- Artifact staging from the Python SDK: in review (PR
, PR
); in java, it's done also
- Flink JobService: in review 
- Translation from a Pipeline proto to Flink datasets and functions: done

- ArtifactStagingService implementation that stages artifacts to a location
on a distributed filesystem: in development (design is clear)

*Job execution:*
- Flink function for executing via an SDK harness: done

- APIs for managing lifecycle of an SDK harness: done

- Specific implementation of those APIs using Docker: part done
, part in review

- ArtifactRetrievalService that retrieves artifacts from the location where
ArtifactStagingService staged them: in development.

We expect that the in-review parts will be done, and the in-development
parts be developed, in the next 2-3 weeks. We will, of course, update the
community when this important milestone is reached.

*After that, the next milestones include:*
- Sett up Java, Python and Go ValidatesRunner tests to run against the
portable Flink runner, and get them to pass
- Expand Python and Go to parity in terms of such test coverage
- Implement the portable Spark runner, with a similar lifecycle but reusing
almost all of the Flink work
- Add support for streaming to both (which requires SDF - that work is
progressing in parallel and by this point should be in a suitable place)

*For people who would like to get involved in this effort: *You can already
help out by improving ValidatesRunner test coverage in Python and Go. Java
has >300 such tests, Python has only a handful. There'll be a large amount
of parallelizable work once we get the VR test suites running - stay tuned.
SDF+Portability is also expected to produce a lot of parallelizable work up
for grabs within several weeks.

Thanks!

Re: Wait.on() - "Do this, then that" transform

2018-05-17 Thread Eugene Kirpichov

I mean it has to return a PCollection of something, that contains elements
representing the result of completing processing of the respective window.
E.g. FileIO.write() returns a PCollection of filenames; SpannerIO.write()
returns simply a PCollection of Void.

However, connectors such as BigtableIO.write() and BigQueryIO.write() don't
return such a PCollection. The former returns PDone; the latter returns a
PCollection of failed inserts that in some cases is unconnected to the
actual processing (when using load jobs).

On Thu, May 17, 2018 at 1:55 PM Ismaël Mejía <ieme...@gmail.com> wrote:

> This sounds super interesting and useful !
>
> Eugene can you please elaborate on this phrase 'has to return a result that
> can be waited on'. It is not clear for me what this means and I would like
> to understand this to evaluate what other IOs could potentially support
> this.
>
>
> On Thu, May 17, 2018 at 10:13 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
> > Thanks Kenn, forwarding to user@ is a good idea; just did that.
>
> > JB - this is orthogonal to SDF, because I'd expect this transform to be
> primarily used for waiting on the results of SomethingIO.write(), whereas
> SDF is primarily useful for implementing SomethingIO.read().
>
> > On Mon, May 14, 2018 at 10:25 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> >> Cool !!!
>
> >> I guess we can leverage this in IOs with SDF.
>
> >> Thanks
> >> Regards
> >> JB
>
> >> On 14/05/2018 23:48, Eugene Kirpichov wrote:
> >> > Hi folks,
> >> >
> >> > Wanted to give a heads up about the existence of a commonly requested
> >> > feature and its first successful production usage.
> >> >
> >> > The feature is the Wait.on() transform [1] , and the first successful
> >> > production usage is in Spanner [2] .
> >> >
> >> > The Wait.on() transform allows you to "do this, then that" - in the
> >> > sense that a.apply(Wait.on(signal)) re-emits PCollection "a", but only
> >> > after the PCollection "signal" is "done" in the same window (i.e. when
> >> > no more elements can arrive into the same window of "signal"). The
> >> > PCollection "signal" is typically a collection of results of some
> >> > operation - so Wait.on(signal) allows you to wait until that operation
> >> > is done. It transparently works correctly in streaming pipelines too.
> >> >
> >> > This may sound a little convoluted, so the example from documentation
> >> > should help.
> >> >
> >> > PCollection firstWriteResults = data.apply(ParDo.of(...write to
> >> > first database...));
> >> > data.apply(Wait.on(firstWriteResults))
> >> >   // Windows of this intermediate PCollection will be processed no
> >> > earlier than when
> >> >   // the respective window of firstWriteResults closes.
> >> >   .apply(ParDo.of(...write to second database...));
> >> >
> >> > This is indeed what Spanner folks have done, and AFAIK they intend
> this
> >> > for importing multiple dependent database tables - e.g. first import a
> >> > parent table; when it's done, import the child table - all within one
> >> > pipeline. You can see example code in the tests [3].
> >> >
> >> > Please note that this kind of stuff requires support from the IO
> >> > connector - IO.write() has to return a result that can be waited on.
> The
> >> > code of SpannerIO is a great example; another example is
> FileIO.write().
> >> >
> >> > People have expressed wishes for similar support in Bigtable and
> >> > BigQuery connectors but it's not there yet. It would be really cool if
> >> > somebody added it to these connectors or others (I think there was a
> >> > recent thread discussing how to add it to BigQueryIO).
> >> >
> >> > [1]
> >> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Wait.java
> >> > [2] https://github.com/apache/beam/pull/4264
> >> > [3]
> >> >
>
> https://github.com/apache/beam/blob/a3ce091b3bbebf724c63be910bd3bc4cede4d11f/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteIT.java#L158
> >> >
>

Re: Wait.on() - "Do this, then that" transform

2018-05-17 Thread Eugene Kirpichov

Thanks Kenn, forwarding to user@ is a good idea; just did that.

JB - this is orthogonal to SDF, because I'd expect this transform to be
primarily used for waiting on the results of SomethingIO.write(), whereas
SDF is primarily useful for implementing SomethingIO.read().

On Mon, May 14, 2018 at 10:25 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Cool !!!
>
> I guess we can leverage this in IOs with SDF.
>
> Thanks
> Regards
> JB
>
> On 14/05/2018 23:48, Eugene Kirpichov wrote:
> > Hi folks,
> >
> > Wanted to give a heads up about the existence of a commonly requested
> > feature and its first successful production usage.
> >
> > The feature is the Wait.on() transform [1] , and the first successful
> > production usage is in Spanner [2] .
> >
> > The Wait.on() transform allows you to "do this, then that" - in the
> > sense that a.apply(Wait.on(signal)) re-emits PCollection "a", but only
> > after the PCollection "signal" is "done" in the same window (i.e. when
> > no more elements can arrive into the same window of "signal"). The
> > PCollection "signal" is typically a collection of results of some
> > operation - so Wait.on(signal) allows you to wait until that operation
> > is done. It transparently works correctly in streaming pipelines too.
> >
> > This may sound a little convoluted, so the example from documentation
> > should help.
> >
> > PCollection firstWriteResults = data.apply(ParDo.of(...write to
> > first database...));
> > data.apply(Wait.on(firstWriteResults))
> >   // Windows of this intermediate PCollection will be processed no
> > earlier than when
> >   // the respective window of firstWriteResults closes.
> >   .apply(ParDo.of(...write to second database...));
> >
> > This is indeed what Spanner folks have done, and AFAIK they intend this
> > for importing multiple dependent database tables - e.g. first import a
> > parent table; when it's done, import the child table - all within one
> > pipeline. You can see example code in the tests [3].
> >
> > Please note that this kind of stuff requires support from the IO
> > connector - IO.write() has to return a result that can be waited on. The
> > code of SpannerIO is a great example; another example is FileIO.write().
> >
> > People have expressed wishes for similar support in Bigtable and
> > BigQuery connectors but it's not there yet. It would be really cool if
> > somebody added it to these connectors or others (I think there was a
> > recent thread discussing how to add it to BigQueryIO).
> >
> > [1]
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Wait.java
> > [2] https://github.com/apache/beam/pull/4264
> > [3]
> >
> https://github.com/apache/beam/blob/a3ce091b3bbebf724c63be910bd3bc4cede4d11f/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteIT.java#L158
> >
>

Re: Checkpointing and Restoring BoundedSource

2018-05-16 Thread Eugene Kirpichov

Hi Shen,
The only guarantee made by splitAtFraction is that the primary source +
residual source = original source - there are no other guarantees.
For checkpointing, you can use the following pattern:
When you want to checkpoint, call splitAtFraction(getFractionConsumed() +
epsilon) or something like that [the API is not great - ideally it would
have been splitAfterFractionOfRemainder(0.0) if there was such a method].
That means "stop as soon as possible"; after this call succeeds, its return
value is a checkpoint to resume from; however you still need to let the
current reader complete - it may have a few more records or even blocks
left [with SDF, there's a similar but more explicit method checkpoint()
that guarantees that there'll be no more blocks].

On Wed, May 16, 2018 at 11:14 AM Shen Li  wrote:

> Hi,
>
> After recovering from a checkpoint, is it correct to use
> BoundedSource.BoundedReader#splitAtFraction(double) to resume a
> BoundedSource? My concern is that the doc says "the new range would contain
> *approximately* the given fraction of the amount of data in the current
> range." Does the word *approximately* here mean that the application could
> potentially miss some data from the BoundedSource if resume
> from reader.splitAtFraction(fraction)?
>
> Thanks,
> Shen
>

Wait.on() - "Do this, then that" transform

2018-05-14 Thread Eugene Kirpichov

Hi folks,

Wanted to give a heads up about the existence of a commonly requested
feature and its first successful production usage.

The feature is the Wait.on() transform [1] , and the first successful
production usage is in Spanner [2] .

The Wait.on() transform allows you to "do this, then that" - in the sense
that a.apply(Wait.on(signal)) re-emits PCollection "a", but only after the
PCollection "signal" is "done" in the same window (i.e. when no more
elements can arrive into the same window of "signal"). The PCollection
"signal" is typically a collection of results of some operation - so
Wait.on(signal) allows you to wait until that operation is done. It
transparently works correctly in streaming pipelines too.

This may sound a little convoluted, so the example from documentation
should help.

PCollection firstWriteResults = data.apply(ParDo.of(...write to first
database...));
data.apply(Wait.on(firstWriteResults))
 // Windows of this intermediate PCollection will be processed no
earlier than when
 // the respective window of firstWriteResults closes.
 .apply(ParDo.of(...write to second database...));

This is indeed what Spanner folks have done, and AFAIK they intend this for
importing multiple dependent database tables - e.g. first import a parent
table; when it's done, import the child table - all within one pipeline.
You can see example code in the tests [3].

Please note that this kind of stuff requires support from the IO connector
- IO.write() has to return a result that can be waited on. The code of
SpannerIO is a great example; another example is FileIO.write().

People have expressed wishes for similar support in Bigtable and BigQuery
connectors but it's not there yet. It would be really cool if somebody
added it to these connectors or others (I think there was a recent thread
discussing how to add it to BigQueryIO).

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Wait.java
[2] https://github.com/apache/beam/pull/4264
[3]
https://github.com/apache/beam/blob/a3ce091b3bbebf724c63be910bd3bc4cede4d11f/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteIT.java#L158

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-05-14 Thread Eugene Kirpichov

Hi Matthias,

Thank you. Here is the raw file on Google Drive:
https://drive.google.com/file/d/1gUoe6UrpNNO3ijYSgTvD5GBy-14Zx6dT/view?usp=sharing

And yes, I have permission from O'Reilly/Strata to use this file whichever
way I want, so it's ok to share on YouTube.

On Fri, May 11, 2018 at 12:13 PM Matthias Baetens <baetensmatth...@gmail.com>
wrote:

> Hey Eugene,
>
> Apologies for picking this up so late, but I could help uploading your
> video to the Beam channel.
> Are you able to send me the raw file and do you have sign-off to go ahead
> with sharing it on YouTube?
>
> Thanks.
> Matthias
>
> On Sat, 14 Apr 2018 at 21:45 Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Hi all,
>>
>> The video is now available. I got it from my Strata account and I have
>> permission to use and share it freely, so I published it on my own YouTube
>> page (where there's nothing else...). Perhaps it makes sense to add to the
>> Beam YouTube channel, but AFAIK only a PMC member can do that.
>>
>> https://www.youtube.com/watch?v=NIn9E5TVoCA
>>
>>
>> On Tue, Mar 13, 2018 at 3:33 AM James <xumingmi...@gmail.com> wrote:
>>
>>> Very informative, thanks!
>>>
>>> On Fri, Mar 9, 2018 at 4:49 PM Etienne Chauchot <echauc...@apache.org>
>>> wrote:
>>>
>>>> Great !
>>>>
>>>> Thanks for sharing.
>>>>
>>>> Etienne
>>>>
>>>
>>>> Le jeudi 08 mars 2018 à 19:49 +, Eugene Kirpichov a écrit :
>>>>
>>>> Hey all,
>>>>
>>>> The slides for my yesterday's talk at Strata San Jose
>>>> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
>>>>  have
>>>> been posted on the talk page. They may be of interest both to users and IO
>>>> authors.
>>>>
>>>> Thanks.
>>>>
>>>> --
>
>

Re: Eventual PAssert

2018-05-14 Thread Eugene Kirpichov

Thanks Anton, this is a really important topic because currently we have no
way at all for unit-testing IOs that emit unbounded output.
Regardless of the proposed PAssert API itself, even if we just figure out a
way to make the pipeline terminate on some condition from within the
pipeline, that'll be great.

On Mon, May 14, 2018 at 1:57 PM Anton Kedin  wrote:

> Hi,
>
> While working on an integration test
>  for Pubsub-related
> functionality I couldn't find a good solution to test the pipelines that
> don't reliably stop.
>
> I propose we extend PAssert to support eventual verification. In this
> case some success/failure predicate is being constantly evaluated against
> all elements of the pipeline until it's met. At that point the result gets
> communicated to the main program/test.
>
> Example API:
>
> *PAssert  .thatEventually(pcollection)  .containsInAnyOrder(e1, e2, e3)
>  .synchronizingOver(signalOverPubsub());  .timeoutAfter(10 min)*
>
> Details doc
> 
>
> Comments, thoughts, things that I missed?
>
> Regards,
> Anton
>

Re: Launching a Portable Pipeline

2018-05-14 Thread Eugene Kirpichov

Thanks Ankur, this document clarifies a few points and raises some very
important questions. I encourage everybody with a stake in Portability to
take a look and chime in.

+Aljoscha Krettek  +Thomas Weise
 +Henning Rohde 

On Mon, May 14, 2018 at 12:34 PM Ankur Goenka  wrote:

> Updated link
> 
>  to
> the document as the previous link was not working for some people.
>
>
> On Fri, May 11, 2018 at 7:56 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> Recent effort on portability has introduced JobService and
>> ArtifactService to the beam stack along with SDK. This has open up a few
>> questions around how we start a pipeline in a portable setup (with
>> JobService).
>> I am trying to document our approach to launching a portable pipeline and
>> take binding decisions based on the discussion.
>> Please review the document
>> 
>>  and
>> provide your feedback.
>>
>> Thanks,
>> Ankur
>>
>

Re: Graal instead of docker?

2018-05-11 Thread Eugene Kirpichov

On Fri, May 11, 2018 at 11:48 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le ven. 11 mai 2018 18:15, Andrew Pilloud <apill...@google.com> a écrit :
>
>> Json and Protobuf aren't the same thing. Json is for exchanging
>> unstructured data, Protobuf is for exchanging structured data. The point of
>> Portability is to define a protocol for exchanging structured messages
>> across languages. What do you propose using on top of Json to define
>> message structure?
>>
>
> Im fine with protobuf contracts, not with all the rest (libs*). Json has
> the advantage to not require much for consumers and be easy to integrate
> and proxy. Protobuf imposes a lot for that layer which will be typed by the
> runner anyway so no need of 2 typings layers.
>
>
>> I'd like to see the generic runner rewritten in Golang so we can
>> eliminate the significant overhead imposed by the JVM. I would argue that
>> Go is the best language for low overhead infrastructure, and is already
>> widely used by projects in this space such as Docker, Kubernetes, InfluxDB.
>> Even SQL can take advantage of this. For example, several runners could be
>> passed raw SQL and use their own SQL engines to implement more efficient
>> transforms then generic Beam can. Users will save significant $$$ on
>> infrastructure by not having to involve the JVM at all.
>>
>
> Yes...or no. Jvm overhead is very low gor such infra, less than 64M of ram
> and almost no cpu so will not help much for cluster or long lived processes
> like the ones we talk about.
>
> Also beam community is java - dont answer it is python or go without
> checking ;). Not sure adding a new language will help and give a face
> people will like to contribute or use the project.
>
> Currently id say the way the router runner is done is a detail but the
> choice to rethink current impl a crucial atchitectural point.
>
> No issue having N router impls too, one in java, one in go (but thought we
> had very few go resources/lovers in beam?), one in python where it would
> make a lot of sense and would show a router without actual primitive impl
> (delegating to direct java runner), etc...
>
> But one thing at a time, anyone against stopping current impl track,
> revert it and move to a higher level runner?
>
I am still not clear as to exactly what kind of change you are proposing.
First it looked like you were proposing to not have a hard dependency on
Docker, and that got resolved (we don't). Now it sounds like you're against
Java protobuf libraries, but that doesn't strike me as something warranting
an architecture change. If you have something else in mind, I'm afraid from
your recent emails I can't tell what it is.

Could you please create a document detailing:
- What precisely are the issues you see with some of the current
portability APIs
- How you think those APIs should look like instead
- How you think the path from the current implementation to your desired
state would look like
- If you're proposing major changes to the direction of work of a large
number of people, then please also elaborate in your document as to what
impact your proposal will have on the current work, or how this impact can
be minimized.

Please make sure to scan the pre-existing portability design documents to
see if similar concerns had already been discussed before. As others have
pointed out in this thread several times, almost everything that you're
asking has already been discussed. If you believe the discussion of a
particular issue has been insufficient, feel free to re-raise it on the
mailing list, by linking to the previous discussion and elaborating what
aspect you think has been missed; if you can't find the discussion of a
particular crucial design decision, feel free to raise that on the mailing
list too and people will be happy to help you find it.

I would like also to ask you to adjust the tone of your comments such as
"beam is being driven by an implementation instead of a clear and scalable
architecture", "All the work done looks like an implemzntation detail of
one runner+vendor corrupting all the project" and "A bad architecture which
doeent embrace the community". These kinds of comments, to me, sound not
only unconstructively vague, but dismissive of the years of design and
implementation work done by dozens of people in this area. We are doing
something that's never done before, and the APIs and implementation are not
perfect and will continue evolving, but there are much more effective and
friendly ways to point out use cases where they fail or ways in which they
can be improved.

>
>
>> Andrew
>>
>> On Fri, May 11, 2018 at 8:53 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>>
>>&

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-11 Thread Eugene Kirpichov

I'm not sure if this has been proposed in this thread, but if the common
case is that users consume the whole iterator, then you can close resources
at !hasNext(). And for cleanup of incompletely consumed iterators, rely on
what Kenn suggested. Since you're making your own runner, you can add
additional places where cleanup happens automatically, eg across
ProcessElement or bundle boundaries.

On Fri, May 11, 2018, 10:12 AM Lukasz Cwik  wrote:

> Alternatively to using weak/phantom reference:
> * Can you configure RocksDb's memory usage/limits?
> * Inside the iterator, periodically close and re-open the RocksDb
> connection seeking back to where the user was?
> * Use the ParDo/DoFn lifecycle and clean up after each
> processElement/finishBundle call.
>
>
> On Fri, May 11, 2018 at 9:51 AM Xinyu Liu  wrote:
>
>> Thanks for drafting the details about the two approaches, Kenn. Now I
>> understand Luke's proposal better. The approach looks neat, but the
>> uncertainty of *when* GC is going to kick in will make users' life hard.
>> If the user happens to configure a large JVM heap size, and since rocksDb
>> uses off-heap memory, GC might start very late and less frequent than what
>> we want. If we don't have a *definitive* way to let user close the
>> underlying resources, then there is no good way to handle such failures of
>> a critical application in production.
>>
>> Having a close method in the iterator might be a little unorthodox :). To
>> some degree, this is actually a resource we are holding underneath, and I
>> think it's pretty common to have close() for a resource in Java, e.g.
>> BufferedReader and BufferedWriter. So I would imagine that we also define a
>> resource for the state iterator and make the interface implements
>> *AutoCloseable*. Here is my sketch:
>>
>> // StateResource MUST be closed after use.
>> try (StateResource> st = bagState.iteratorResource())
>> {
>> Iterator iter = st.iterator();
>> while (iter.hasNext() {
>>.. do stuff ...
>> }
>> } catch (Exception e) {
>> ... user code
>> }
>>
>> The type/method name are just for illustrating here, so please don't
>> laugh at them. Please feel free to comment and let me know if you have
>> thoughts about the programming patterns here.
>>
>> Thanks,
>> Xinyu
>>
>> On Thu, May 10, 2018 at 8:59 PM, Kenneth Knowles  wrote:
>>
>>> It is too soon to argue whether an API is complex or not. There has been
>>> no specific API proposed.
>>>
>>> I think the problem statement is real - you need to be able to read and
>>> write bigger-than-memory state. It seems we have multiple runners that
>>> don't support it, perhaps because of our API. You might be able to build
>>> something good enough with phantom references, but you might not.
>>>
>>> If I understand the idea, it might look something like this:
>>>
>>> new DoFn<>() {
>>>@StateId("foo")
>>>private final StateSpec myBagSpec = ...
>>>
>>>@ProcessElement
>>>public void proc(@StateId("foo") BagState myBag, ...) {
>>>  CloseableIterator iterator = myBag.get().iterator();
>>>  while(iterator.hasNext() && ... special condition ...) {
>>>... do stuff ...
>>>  }
>>>  iterator.close();
>>>}
>>>  }
>>>
>>> So it has no impact on users who don't choose to close() since they
>>> iterate with for ( : ) as usual. And a runner that has the 10x funding to
>>> try out a ReferenceQueue can be resilient to users that forget. On the
>>> other hand, I haven't seen this pattern much in the wild, so I think it is
>>> valuable to discuss other methods.
>>>
>>> While Luke's proposal is something like this if I understand his sketch
>>> (replacing WeakReference with PhantomReference seems to be what you really
>>> want):
>>>
>>> ... in RocksDb state implementation ...
>>> class RocksDbBagState {
>>>   static ReferenceQueue rocksDbIteratorQueue = new ReferenceQueue();
>>>
>>>   class Iterator {
>>>  PhantomReference cIter;
>>>  .next() {
>>>return cIter.next();
>>>  }
>>>   }
>>>
>>>  class Iterable {
>>> .iterator() {
>>>   return new Iterator(new PhantomReference<>(rocksDbJniIterator,
>>> rocksDbIteratorQueue));
>>> }
>>>   }
>>> }
>>>
>>> ... on another thread ...
>>> while(true) {
>>>   RocksDbIterator deadRef = (RocksDbIterator)
>>> rocksDbIteratorQueue.remove();
>>>   deadRef.close();
>>> }
>>>
>>> When the iterator is GC'd, the phantom reference will pop onto the queue
>>> for being closed. This might not be too bad. You'll have delayed resource
>>> release, and potentially masked errors that are hard to debug. It is less
>>> error-prone than WeakReference, which is asking for trouble when objects
>>> are collected en masse. Anecdotally I have heard that performance of this
>>> kind of approach is poor, but I

Re: Graal instead of docker?

2018-05-09 Thread Eugene Kirpichov

On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau 
wrote:

>
>
> Le mer. 9 mai 2018 00:57, Henning Rohde  a écrit :
>
>> There are indeed lots of possibilities for interesting docker
>> alternatives with different tradeoffs and capabilities, but in generally
>> both the runner as well as the SDK must support them for it to work. As
>> mentioned, docker (as used in the container contract) is meant as a
>> flexible main option but not necessarily the only option. I see no problem
>> with certain pipeline-SDK-runner combinations additionally supporting a
>> specialized setup. Pipeline can be a factor, because that some transforms
>> might depend on aspects of the runtime environment -- such as system
>> libraries or shelling out to a /bin/foo.
>>
>> The worker boot code is tied to the current container contract, so
>> pre-launched workers would presumably not use that code path and are not be
>> bound by its assumptions. In particular, such a setup might want to invert
>> who initiates the connection from the SDK worker to the runner. Pipeline
>> options and global state in the SDK and user functions process might make
>> it difficult to safely reuse worker processes across pipelines, but also
>> doable in certain scenarios.
>>
>
> This is not that hard actually and most java env do it.
>
> Main concern is 1. Being tight to an impl detail and 2. A bad architecture
> which doeent embrace the community
>
Could you please be more specific? Concerns about Docker dependency have
already been repeatedly addressed in this thread.

>
>
>
>> Henning
>>
>> On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
>>> wrote:
>>>

 I would welcome changes to

 https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
 that would provide alternatives to docker (one of which comes to mind
 is "I
 already brought up a worker(s) for you (which could be the same process
 that handled pipeline construction in testing scenarios), here's how to
 connect to it/them.") Another option, which would seem to appeal to you
 in
 particular, would be "the worker code is linked into the runner's
 binary,
 use this process as the worker" (though note even for java-on-java, it
 can
 be advantageous to shield the worker and runner code from each others
 environments, dependencies, and version requirements.) This latter
 should
 still likely use the FnApi to talk to itself (either over GRPC on local
 ports, or possibly better via direct function calls eliminating the RPC
 overhead altogether--this is how the fast local runner in Python works).
 There may be runner environments well controlled enough that "start up
 the
 workers" could be specified as "run this command line." We should make
 this
 environment message extensible to other alternatives than "docker
 container
 url," though of course we don't want the set of options to grow too
 large
 or we loose the promise of portability unless every runner supports
 every
 protocol.

>>> The pre-launched worker would be an interesting option, which might work
>>> well for a sidecar deployment.
>>>
>>> The current worker boot code though makes the assumption that the runner
>>> endpoint to phone home to is known when the process is launched. That
>>> doesn't work so well with a runner that establishes its endpoint
>>> dynamically. Also, the assumption is baked in that a worker will only serve
>>> a single pipeline (provisioning API etc.).
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>

Re: Graal instead of docker?

2018-05-08 Thread Eugene Kirpichov

gt;> up.
>>
>> >> I would welcome changes to
>>
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> >> that would provide alternatives to docker (one of which comes to mind
>> is
>> "I
>> >> already brought up a worker(s) for you (which could be the same process
>> >> that handled pipeline construction in testing scenarios), here's how to
>> >> connect to it/them.") Another option, which would seem to appeal to you
>> in
>> >> particular, would be "the worker code is linked into the runner's
>> binary,
>> >> use this process as the worker" (though note even for java-on-java, it
>> can
>> >> be advantageous to shield the worker and runner code from each others
>> >> environments, dependencies, and version requirements.) This latter
>> should
>> >> still likely use the FnApi to talk to itself (either over GRPC on local
>> >> ports, or possibly better via direct function calls eliminating the RPC
>> >> overhead altogether--this is how the fast local runner in Python
>> works).
>> >> There may be runner environments well controlled enough that "start up
>> the
>> >> workers" could be specified as "run this command line." We should make
>> this
>> >> environment message extensible to other alternatives than "docker
>> container
>> >> url," though of course we don't want the set of options to grow too
>> large
>> >> or we loose the promise of portability unless every runner supports
>> every
>> >> protocol.
>>
>> >> Of course, the runner is always free to execute any Fn for which it
>> >> completely understands the URN and the environment any way it pleases,
>> e.g.
>> >> directly in process, or even via lighter-weight mechanism like Jython
>> or
>> >> Graal, rather than asking an external process to do it. But we need a
>> >> lowest common denominator for executing arbitrary URNs runners are not
>> >> expected to understand.
>>
>> >> As an aside, there are also technical limitations in implementing
>> >> Portability
>> >> by simply requiring all runners to be Java and the portable layer
>> simply
>> >> being wrappers of UserFnInLangaugeX in an equivalent
>> UserFnObjectInJava,
>> >> executing everything as if it were pure Java. In particular the
>> overheads
>> >> of unnecessarily crossing the language boundaries many times in a
>> single
>> >> fused graph are often prohibitive.
>>
>> >> Sorry for the long email, but hopefully this helps shed some light on
>> (at
>> >> least how I see) the portability effort (at the core of the Beam vision
>> >> statement) as well as concrete actions we can take to decouple it from
>> >> specific technologies.
>>
>> >> - Robert
>>
>>
>> >> On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com>
>> >> wrote:
>>
>> >> > All are good points.
>>
>> >> > The only "?" I keep is: why beam doesnt uses its visitor api to make
>> the
>> >> portability transversal to all runners "mutating" the user model before
>> >> translation? Technically it sounds easy and avoid hacking all impl. Was
>> it
>> >> tested and failed?
>>
>> >> > Le 5 mai 2018 22:50, "Thomas Weise" <t...@apache.org> a écrit :
>>
>> >> >> Docker isn't a silver bullet and may not be the best choice for all
>> >> environments (I'm also looking at potentially launching SDK workers in
>> a
>> >> different way), but AFAIK there has not been any alternative proposal
>> for
>> >> default SDK execution that can handle all of Python, Go and Java.
>>
>> >> >> Regardless of the default implementation, we should strive to keep
>> the
>> >> implementation modular so users can plug in their own replacement as
>> >> needed. Looking at the prototype implementation, Docker comes
>> downstream
>> of
>> >> FlinkExecutableStageFunction, and it will be possible to supply a
>> custom
>> >> implementation by making the translator pluggable (which I intend to
>> work
>> >> on once backporting to master is complete), and possibly
>> >> "SDKHarnessManager&qu

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov

Not sure what you mean? Can you point to a piece of code in Beam that
you're currently characterizing as "hacking" and suggest how it could be
refactored?

On Sat, May 5, 2018 at 2:06 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> All are good points.
>
> The only "?" I keep is: why beam doesnt uses its visitor api to make the
> portability transversal to all runners "mutating" the user model before
> translation? Technically it sounds easy and avoid hacking all impl. Was it
> tested and failed?
>
> Le 5 mai 2018 22:50, "Thomas Weise" <t...@apache.org> a écrit :
>
>> Docker isn't a silver bullet and may not be the best choice for all
>> environments (I'm also looking at potentially launching SDK workers in a
>> different way), but AFAIK there has not been any alternative proposal for
>> default SDK execution that can handle all of Python, Go and Java.
>>
>> Regardless of the default implementation, we should strive to keep the
>> implementation modular so users can plug in their own replacement as
>> needed. Looking at the prototype implementation, Docker comes downstream of
>> FlinkExecutableStageFunction, and it will be possible to supply a custom
>> implementation by making the translator pluggable (which I intend to work
>> on once backporting to master is complete), and possibly
>> "SDKHarnessManager" itself can also be swapped out.
>>
>> I would also prefer that for Flink and other Java based runners we retain
>> the option to inline executable stages that are in Java. I would expect a
>> good number of use cases to benefit from direct execution in the task
>> manager, and it may be good to offer the user that optimization.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> On Sat, May 5, 2018 at 12:54 PM, Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> To add on that: Romain, if you are really excited about Graal as a
>>> project, here are some constructive suggestions as to what you can do on a
>>> reasonably short timeframe:
>>> - Propose/prototype a design for writing UDFs in Beam SQL using Graal
>>> - Go through the portability-related design documents, come up with a
>>> more precise assessment of what parts are actually dependent on Docker's
>>> container format and/or on Docker itself, and propose a plan for untangling
>>> this dependency and opening the door to other mechanisms of cross-language
>>> execution
>>>
>>> On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Graal is a very young project, currently nowhere near the level of
>>>> maturity or completeness as to be sufficient for Beam to fully bet its
>>>> portability vision on it:
>>>> - Graal currently only claims to support Java and Javascript, with Ruby
>>>> and R in the status of "some applications may run", Python support "just
>>>> beginning", and Go lacking altogether.
>>>> - Regarding existing production usage, the Graal FAQ says it is "a
>>>> project with new innovative technology in its early stages."
>>>>
>>>> That said, as Graal matures, I think it would be reasonable to keep an
>>>> eye on it as a potential future lightweight alternative to containers for
>>>> pipelines where Graal's level of support is sufficient for this particular
>>>> pipeline.
>>>>
>>>> Please also keep in mind that execution of user code is only a small
>>>> part of the overall portability picture, and dependency on Docker is an
>>>> even smaller part of that (there is only 1 mention of the word "Docker" in
>>>> all of Beam's portability protos, and the mention is in an out-of-date TODO
>>>> comment). I hope this addresses your concerns.
>>>>
>>>> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Agree
>>>>>
>>>>> The jvm is still mainstream for big data and it is trivial to have a
>>>>> remote facade to support natives but no point to have it in runners, it is
>>>>> some particular transforms or even dofn and sources only...
>>>>>
>>>>>
>>>>> Le 5 mai 2018 19:03, "Andrew Pilloud" <apill...@google.com> a écrit :
>>>>>
>>>>>> Thanks for the examples earlier, I think Hazelcast is a great
>>>>>>

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov

To add on that: Romain, if you are really excited about Graal as a project,
here are some constructive suggestions as to what you can do on a
reasonably short timeframe:
- Propose/prototype a design for writing UDFs in Beam SQL using Graal
- Go through the portability-related design documents, come up with a more
precise assessment of what parts are actually dependent on Docker's
container format and/or on Docker itself, and propose a plan for untangling
this dependency and opening the door to other mechanisms of cross-language
execution

On Sat, May 5, 2018 at 12:50 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> Graal is a very young project, currently nowhere near the level of
> maturity or completeness as to be sufficient for Beam to fully bet its
> portability vision on it:
> - Graal currently only claims to support Java and Javascript, with Ruby
> and R in the status of "some applications may run", Python support "just
> beginning", and Go lacking altogether.
> - Regarding existing production usage, the Graal FAQ says it is "a project
> with new innovative technology in its early stages."
>
> That said, as Graal matures, I think it would be reasonable to keep an eye
> on it as a potential future lightweight alternative to containers for
> pipelines where Graal's level of support is sufficient for this particular
> pipeline.
>
> Please also keep in mind that execution of user code is only a small part
> of the overall portability picture, and dependency on Docker is an even
> smaller part of that (there is only 1 mention of the word "Docker" in all
> of Beam's portability protos, and the mention is in an out-of-date TODO
> comment). I hope this addresses your concerns.
>
> On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Agree
>>
>> The jvm is still mainstream for big data and it is trivial to have a
>> remote facade to support natives but no point to have it in runners, it is
>> some particular transforms or even dofn and sources only...
>>
>>
>> Le 5 mai 2018 19:03, "Andrew Pilloud" <apill...@google.com> a écrit :
>>
>>> Thanks for the examples earlier, I think Hazelcast is a great example
>>> of something portability might make more difficult. I'm not working on
>>> portability, but my understanding is that the data sent to the runner is a
>>> blob of code and the name of the container to run it in. A runner with a
>>> native language (java on Hazelcast for example) could run the code directly
>>> without the container if it is in a language it supports. So when Hazelcast
>>> sees a known java container specified, it just loads the java blob and runs
>>> it. When it sees another container it rejects the pipeline. You could use
>>> Graal in the Hazelcast runner to do this for a number of languages. I would
>>> expect that this could also be done in the direct runner, which similarly
>>> provides a native java environment, so portable Java pipelines can be
>>> tested without docker?
>>>
>>> For another way to frame this: if Beam was originally written in Go, we
>>> would be having a different discussion. A pipeline written entirely in java
>>> wouldn't be possible, so instead to enable Hazelcast, we would have to be
>>> able to run the java from portability without running the container.
>>>
>>> Andrew
>>>
>>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía <ieme...@gmail.com>:
>>>>
>>>>> Graal would not be a viable solution for the reasons Henning and Andrew
>>>>> mentioned, or put in other words, when users choose a programming
>>>>> language
>>>>> they don’t choose only a ‘friendly’ syntax or programming model, they
>>>>> choose also the ecosystem that comes with it, and the libraries that
>>>>> make
>>>>> their life easier. However isolating these user libraries/dependencies
>>>>> is a
>>>>> hard problem and so far the standard solution to this problem is to use
>>>>> operating systems containers via docker.
>>>>>
>>>>
>>>> Graal solves that Ismael. Same kind of experience than running npm libs
>>>> on nashorn but with a more unified API to run any language soft.
>>>>
>>>>
>>>>>
>>>>> The Beam vision from day zero is to run pipelines written in multiple

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov

Graal is a very young project, currently nowhere near the level of maturity
or completeness as to be sufficient for Beam to fully bet its portability
vision on it:
- Graal currently only claims to support Java and Javascript, with Ruby and
R in the status of "some applications may run", Python support "just
beginning", and Go lacking altogether.
- Regarding existing production usage, the Graal FAQ says it is "a project
with new innovative technology in its early stages."

That said, as Graal matures, I think it would be reasonable to keep an eye
on it as a potential future lightweight alternative to containers for
pipelines where Graal's level of support is sufficient for this particular
pipeline.

Please also keep in mind that execution of user code is only a small part
of the overall portability picture, and dependency on Docker is an even
smaller part of that (there is only 1 mention of the word "Docker" in all
of Beam's portability protos, and the mention is in an out-of-date TODO
comment). I hope this addresses your concerns.

On Sat, May 5, 2018 at 11:49 AM Romain Manni-Bucau 
wrote:

> Agree
>
> The jvm is still mainstream for big data and it is trivial to have a
> remote facade to support natives but no point to have it in runners, it is
> some particular transforms or even dofn and sources only...
>
>
> Le 5 mai 2018 19:03, "Andrew Pilloud"  a écrit :
>
>> Thanks for the examples earlier, I think Hazelcast is a great example of
>> something portability might make more difficult. I'm not working on
>> portability, but my understanding is that the data sent to the runner is a
>> blob of code and the name of the container to run it in. A runner with a
>> native language (java on Hazelcast for example) could run the code directly
>> without the container if it is in a language it supports. So when Hazelcast
>> sees a known java container specified, it just loads the java blob and runs
>> it. When it sees another container it rejects the pipeline. You could use
>> Graal in the Hazelcast runner to do this for a number of languages. I would
>> expect that this could also be done in the direct runner, which similarly
>> provides a native java environment, so portable Java pipelines can be
>> tested without docker?
>>
>> For another way to frame this: if Beam was originally written in Go, we
>> would be having a different discussion. A pipeline written entirely in java
>> wouldn't be possible, so instead to enable Hazelcast, we would have to be
>> able to run the java from portability without running the container.
>>
>> Andrew
>>
>> On Sat, May 5, 2018 at 1:48 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> 2018-05-05 9:27 GMT+02:00 Ismaël Mejía :
>>>
 Graal would not be a viable solution for the reasons Henning and Andrew
 mentioned, or put in other words, when users choose a programming
 language
 they don’t choose only a ‘friendly’ syntax or programming model, they
 choose also the ecosystem that comes with it, and the libraries that
 make
 their life easier. However isolating these user libraries/dependencies
 is a
 hard problem and so far the standard solution to this problem is to use
 operating systems containers via docker.

>>>
>>> Graal solves that Ismael. Same kind of experience than running npm libs
>>> on nashorn but with a more unified API to run any language soft.
>>>
>>>

 The Beam vision from day zero is to run pipelines written in multiple
 languages in runners in multiple systems, and so far we are not doing
 this
 in particular in the Apache runners. The portability work is the
 cleanest
 way to achieve this vision given the constraints.

>>>
>>> Hmm, did I read it wrong and we don't have specific integration of the
>>> portable API in runners? This is what is messing up the runners and
>>> limiting beam adoption on existing runners.
>>> Portable API is a feature buildable on top of runner, not in runners.
>>> Same as a runner implementing the 5-6 primitives can run anything, the
>>> portable API should just rely on that and not require more integration.
>>> It doesn't prevent more deep integrations as for some higher level
>>> primitives existing in runners but it is not the case today for runners so
>>> shouldn't exist IMHO.
>>>
>>>

 I agree however that for the Java SDK to Java runner case this can
 represent additional pain, docker ideally should not be a requirement
 for
 Java users with the Direct runner and debugging a pipeline should be as
 easy as it is today. I think the Univerrsal Local Runner exists to cover
 the Portable case, but after looking at this JIRA I am not sure if
 unification is coming (and by consequence if docker would be mandatory).
 https://issues.apache.org/jira/browse/BEAM-4239

 I suppose for the distributed runners that they must implement the full

Re: How to create a runtime ValueProvider

2018-05-03 Thread Eugene Kirpichov

There is no way to achieve this using ValueProvider. Its value is either
fixed at construction time (StaticValueProvider), or completely dynamic
(evaluated every time you call .get()).
You'll need to implement this using a side input. E.g. take a look at
implementation of BigQueryIO, how it generates unique job id tokens -
https://github.com/apache/beam/blob/bf94e36f67a8bc5d24c795e40697ad2504c8594c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L756


On Thu, May 3, 2018 at 5:42 PM Frank Yellin  wrote:

> [Sorry, I accidentally hit send before I had finished typing . .]
>
> Is there any way to achieve what I'm looking for?  Or is this just beyond
> the scope of ValueProvider and templates?
>
>
>
> On Thu, May 3, 2018 at 5:36 PM, Frank Yellin  wrote:
>
>> I'm attempting to create a dataflow template, and within the template
>> have a variable
>> ValueProvider now
>> such that now is the time the dataflow is started, note the time that the
>> template was created.
>>
>> My first attempt was
>> ValueProvider now =
>> StaticValueProvider.of(DateTime.now(DateTimeZone.UTC));
>>
>> My second attempt was
>>
>>   public interface MyOptions extends PipelineOptions,
>> DataflowPipelineOptions {
>> @Description("Now")
>> @Default.InstanceFactory(GetNow.class)
>> ValueProvider getNow();
>> void setNow(ValueProvider value);
>>   }
>>
>>   static class GetNow implements DefaultValueFactory {
>> @Override
>> public DateTime create(PipelineOptions options) {
>>   return DateTime.now(DateTimeZone.UTC);
>> }
>>   }
>>
>>   ValueProvider now = options.getNow()
>>
>> My final attempt was:
>>
>>ValueProvider> nowFn =
>> StaticValueProvider.of(x -> DateTime.now(DateTimeZone.UTC));
>>
>> ValueProvider now = NestedValueProvider.of(nowFn, x ->
>> x.apply(null));
>>
>>
>>
>> In every case, it was clear that "now" was being set to template-creation
>> time rather than actual runtime.
>>
>> I note that the documentation talks about a RuntimeValueProvider, but
>> there is no user-visible constructor for this.
>>
>>
>>
>>
>

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov

Thanks Kenn! Note though that we should have VR tests for transforms that
have a runner specific override, such as TextIO.write() and Create that you
mentioned.

Agreed that it'd be good to have a more clear packaging separation between
the two.

On Thu, May 3, 2018, 10:35 AM Kenneth Knowles <k...@google.com> wrote:

> Since I went over the PR and dropped a lot of random opinions about what
> should be VR versus NR, I'll answer too:
>
> VR - all primitives: ParDo, GroupByKey, Flatten.pCollections
> (Flatten.iterables is an unrelated composite), Metrics
> VR - critical special composites: Combine
> VR - test infrastructure that ensures tests aren't vacuous: PAssert
> NR - everything else in the core SDK that needs a runner but is really
> only testing the transform, not the runner, notably Create, TextIO,
> extended bits of Combine
> (nothing) - everything in modules that depend on the core SDK can use
> TestPipeline without an annotation; personally I think NR makes sense to
> annotate these, but it has no effect
>
> And it is a good time to mention that it might be very cool for someone to
> take on the task of conceiving of a more independent runner validation
> suite. This framework is clever, but a bit deceptive as runner tests look
> like unit tests of the primitives.
>
> Kenn
>
> On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Thanks Scott, this is awesome!
>> However, we should be careful when choosing what should be
>> ValidatesRunner and what should be NeedsRunner.
>> Could you briefly describe how you made the call and roughly what are the
>> statistics before/after your PR (number of tests in both categories)?
>>
>> On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Thanks for the update Scott. That's really a great job.
>>>
>>> I will ping you on slack about some points as I'm preparing the build
>>> for the release (and I have some issues ).
>>>
>>> Thanks again
>>> Regards
>>> JB
>>> Le 3 mai 2018, à 17:54, Scott Wegner <sweg...@google.com> a écrit:
>>>>
>>>> Note: if you don't care about Java runner tests, you can stop reading
>>>> now.
>>>>
>>>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>>>> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>>>>
>>>> This is work that was long overdue and finally got my attention due to
>>>> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
>>>> TestPipeline and exercise runner behavior through SDK constructs. The tests
>>>> are written runner-agnostic so that they can be run on and validate all
>>>> supported runners.
>>>>
>>>> The framework for these tests is great and writing them is super-easy.
>>>> But as a result, we have way too many of them-- over 250. These tests run
>>>> against all runners, and even when parallelized we see Dataflow post-commit
>>>> times exceeding 3-5 hours [3].
>>>>
>>>> When reading through these tests, we found many of them don't actually
>>>> exercise runner-specific behavior, and were simply using the TestPipeline
>>>> framework to validate SDK components. This is a valid pattern, but tests
>>>> should be annotated with @NeedsRunner instead. With this annotation, the
>>>> tests will run on only a single runner, currently DirectRunner.
>>>>
>>>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>>>> conservatively converts tests which don't need to validate all runners into
>>>> @NeedsRunner. I've also sharded out some very large test classes into
>>>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>>>> the class-level, and we found a couple very large test classes (ParDoTest)
>>>> became stragglers for the entire execution. Hopefully Gradle will soon
>>>> implement dynamic splitting :)
>>>>
>>>> So, the action I'd like to request from others:
>>>> 1) If you are an author of @ValidatesRunner tests, feel free to look
>>>> over the PR and let me know if I missed anything. Kenn Knowles is also
>>>> helping out here.
>>>> 2) If you find yourself writing new @ValidatesRunner tests, please
>>>> consider whether your test is validating runner-provided behavior. If not,
>>>> use @NeedsRunner instead.
>>>>
>>>>
>>>> [1] https://github.com/apache/beam/pull/5218
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>>>>
>>>> [3]
>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>>>>
>>>>
>>>

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov

Thanks Scott, this is awesome!
However, we should be careful when choosing what should be ValidatesRunner
and what should be NeedsRunner.
Could you briefly describe how you made the call and roughly what are the
statistics before/after your PR (number of tests in both categories)?

On Thu, May 3, 2018 at 9:18 AM Jean-Baptiste Onofré  wrote:

> Thanks for the update Scott. That's really a great job.
>
> I will ping you on slack about some points as I'm preparing the build for
> the release (and I have some issues ).
>
> Thanks again
> Regards
> JB
> Le 3 mai 2018, à 17:54, Scott Wegner  a écrit:
>>
>> Note: if you don't care about Java runner tests, you can stop reading now.
>>
>> tl;dr: I've made a pass over all @ValidatesRunner tests in pr/5218 [1]
>> and converted many to @NeedsRunner in order to reduce post-commit runtime.
>>
>> This is work that was long overdue and finally got my attention due to
>> the Gradle migration. As context, @ValidatesRunner [2] tests construct a
>> TestPipeline and exercise runner behavior through SDK constructs. The tests
>> are written runner-agnostic so that they can be run on and validate all
>> supported runners.
>>
>> The framework for these tests is great and writing them is super-easy.
>> But as a result, we have way too many of them-- over 250. These tests run
>> against all runners, and even when parallelized we see Dataflow post-commit
>> times exceeding 3-5 hours [3].
>>
>> When reading through these tests, we found many of them don't actually
>> exercise runner-specific behavior, and were simply using the TestPipeline
>> framework to validate SDK components. This is a valid pattern, but tests
>> should be annotated with @NeedsRunner instead. With this annotation, the
>> tests will run on only a single runner, currently DirectRunner.
>>
>> So, PR/5218 looks at all existing @ValidatesRunner tests and
>> conservatively converts tests which don't need to validate all runners into
>> @NeedsRunner. I've also sharded out some very large test classes into
>> scenario-based sub-classes. This is because Gradle parallelizes tests at
>> the class-level, and we found a couple very large test classes (ParDoTest)
>> became stragglers for the entire execution. Hopefully Gradle will soon
>> implement dynamic splitting :)
>>
>> So, the action I'd like to request from others:
>> 1) If you are an author of @ValidatesRunner tests, feel free to look over
>> the PR and let me know if I missed anything. Kenn Knowles is also helping
>> out here.
>> 2) If you find yourself writing new @ValidatesRunner tests, please
>> consider whether your test is validating runner-provided behavior. If not,
>> use @NeedsRunner instead.
>>
>>
>> [1] https://github.com/apache/beam/pull/5218
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/ValidatesRunner.java
>>
>> [3]
>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/buildTimeTrend
>>
>>
>

Re: Java compiler OOMs on Jenkins/Gradle

2018-05-01 Thread Eugene Kirpichov

Thanks! FWIW seems that my other Jenkins build is about to fail with the
same issue
https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4806/ -
"Expiring Daemon because JVM Tenured space is exhausted"

On Tue, May 1, 2018 at 1:36 PM Lukasz Cwik <lc...@google.com> wrote:

> +sweg...@google.com who is currently messing around with tuning some
> Gradle flags related to the JVM and its memory usage.
>
> On Tue, May 1, 2018 at 1:34 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Hi,
>>
>> I've seen the same issue twice in a row on PR
>> https://github.com/apache/beam/pull/4264 : the Java precommit fails with
>> messages like:
>>
>> > Task :beam-sdks-java-core:compileTestJava
>> An exception has occurred in the compiler ((version info not available)).
>> Please file a bug against the Java compiler via the Java bug reporting page
>> (http://bugreport.java.com) after checking the Bug Database (
>> http://bugs.java.com) for duplicates. Include your program and the
>> following diagnostic in your report. Thank you.
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> Full build link:
>> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4803/consoleFull
>>
>> Anybody know what's up with that? I thought we got new powerful Jenkins
>> executors and we shouldn't be running out of memory? However, I see that
>> the build specifies* -Dorg.gradle.jvmargs=-Xmx512m* - this seems too
>> small. Should we increase this?
>>
>> Thanks.
>>
>

Java compiler OOMs on Jenkins/Gradle

2018-05-01 Thread Eugene Kirpichov

Hi,

I've seen the same issue twice in a row on PR
https://github.com/apache/beam/pull/4264 : the Java precommit fails with
messages like:

> Task :beam-sdks-java-core:compileTestJava
An exception has occurred in the compiler ((version info not available)).
Please file a bug against the Java compiler via the Java bug reporting page
(http://bugreport.java.com) after checking the Bug Database (
http://bugs.java.com) for duplicates. Include your program and the
following diagnostic in your report. Thank you.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Full build link:
https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4803/consoleFull

Anybody know what's up with that? I thought we got new powerful Jenkins
executors and we shouldn't be running out of memory? However, I see that
the build specifies* -Dorg.gradle.jvmargs=-Xmx512m* - this seems too small.
Should we increase this?

Thanks.

Re: Splittable DoFN in Spark discussion

2018-04-30 Thread Eugene Kirpichov

I think this stuff is happening in SparkGroupAlsoByWindowViaWindowSet:
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L610

As far as I can tell, there is no infinite stream of pings involved.
However, Spark documentation says under
https://spark.apache.org/docs/latest/streaming-programming-guide.html#updatestatebykey-operation
:
"In every batch, Spark will apply the state update function for all
existing keys, regardless of whether they have new data in a batch or not"

It even provides a way to GC the state - " If the update function returns
None then the key-value pair will be eliminated."

This looks promising. Does Spark streaming always periodically create some
batches, and they just turn out empty if there's no data? If so, then we
probably won't even need an infinite stream of pings.

On Fri, Apr 27, 2018 at 12:14 PM Kenneth Knowles <k...@google.com> wrote:

> On Fri, Apr 27, 2018 at 12:06 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Fri, Apr 27, 2018 at 11:56 AM Kenneth Knowles <k...@google.com> wrote:
>>
>> > I'm still pretty shallow on this topic & this thread, so forgive if I'm
>> restating or missing things.
>>
>> > My understanding is that the Spark runner does support Beam's triggering
>> semantics for unbounded aggregations, using the same support code from
>> runners/core that all runners use. Relevant code in SparkTimerInternals
>> and
>> SparkGroupAlsoByWindowViaWindowSet.
>>
>> > IIRC timers are stored in state, scanned each microbatch to see which
>> are
>> eligible.
>>
>> I think the issue (which is more severe in the case of sources) is what to
>> do if no more date comes in to trigger another microbatch.
>>
>
> So will a streaming pipeline fail to trigger in this case? I have this
> feeling the "join with an infinite stream of pings" might already be
> happening.
>
> Kenn
>
>
>
>> > I don't see an immediate barrier to having timer loops. I don't know
>> about performance of this approach, but currently the number of timers per
>> shard (key+window) is bounded by their declarations in code, so it is a
>> tiny number unless codegenerated. We do later want to have dynamic timers
>> (some people call it a TimerMap by analogy with MapState) but I haven't
>> seen a design or even a sketch that I can recall.
>>
>> > Kenn
>>
>> > On Thu, Apr 26, 2018 at 1:48 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>> >> Yeah that's been the implied source of being able to be continuous, you
>> union with a receiver which produce an infinite number of batches (the
>> "never ending queue stream" but not actually a queuestream since they have
>> some limitations but our own implementation there of).
>>
>> >> On Tue, Apr 24, 2018 at 11:54 PM, Reuven Lax <re...@google.com> wrote:
>>
>> >>> Could we do this behind the scenes by writing a Receiver that
>> publishes
>> periodic pings?
>>
>> >>> On Tue, Apr 24, 2018 at 10:09 PM Eugene Kirpichov <
>> kirpic...@google.com>
>> wrote:
>>
>> >>>> Kenn - I'm arguing that in Spark SDF style computation can not be
>> expressed at all, and neither can Beam's timers.
>>
>> >>>> Spark, unlike Flink, does not have a timer facility (only state), and
>> as far as I can tell its programming model has no other primitive that can
>> map a finite RDD into an infinite DStream - the only way to create a new
>> infinite DStream appears to be to write a Receiver.
>>
>> >>>> I cc'd you because I'm wondering whether you've already investigated
>> this when considering whether timers can be implemented on the Spark
>> runner.
>>
>> >>>> On Tue, Apr 24, 2018 at 2:53 PM Kenneth Knowles <k...@google.com>
>> wrote:
>>
>> >>>>> I don't think I understand what the limitations of timers are that
>> you are referring to. FWIW I would say implementing other primitives like
>> SDF is an explicit non-goal for Beam state & timers.
>>
>> >>>>> I got lost at some point in this thread, but is it actually
>> necessary
>> that a bounded PCollection maps to a finite/bounded structure in Spark?
>> Skimming, I'm not sure if the problem is that we can't transliterate Beam
>> to Spark (this might be a good sign) or that we can't express SDF style
>> computation at all (seems far-fetched, but I could be convinced). D

Re: Kafka connector for Beam Python SDK

2018-04-30 Thread Eugene Kirpichov

cting function to be able to maintain a watermark. Notably,
>>> PubsubIO does not accept such a function, but requires the timestamp to be
>>> in a metadata field that any language can describe (versus having to parse
>>> the message to pull out the timestamp).
>>>
>>> Kenn
>>>
>>> [1]
>>> https://beam.apache.org/contribute/ptransform-style-guide/#choosing-types-of-input-and-output-pcollections
>>>
>>> On Mon, Apr 30, 2018 at 9:27 AM Reuven Lax <re...@google.com> wrote:
>>>
>>>> Another point: cross-language IOs might add a performance penalty in
>>>> many cases. For an example of this look at BigQueryIO. The user can
>>>> register a SerializableFunction that is evaluated on every record, and
>>>> determines which destination to write the record to. Now a Python user
>>>> would want to register a Python function for this of course. this means
>>>> that the Java IO would have to invoke Python code for each record it sees,
>>>> which will likely be a big performance hit.
>>>>
>>>> Of course the downside of duplicating IOs is exactly as you say -
>>>> multiple versions to maintain, and potentially duplicate bugs. I think the
>>>> right answer will need to be on a case-by-case basis.
>>>>
>>>> Reuven
>>>>
>>>> On Mon, Apr 30, 2018 at 8:05 AM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>> Hi Aljoscha,
>>>>>
>>>>> I tried to cover this in the doc. Once we have full support for
>>>>> cross-language IO, we can decide this on a case-by-case basis. But I don't
>>>>> think we should cease defining new sources/sinks for Beam Python SDK till
>>>>> we get to that point. I think there are good reasons for adding Kafka
>>>>> support for Python today and many Beam users have request this. Also, note
>>>>> that proposed Python Kafka source will be based on the Splittable DoFn
>>>>> framework while the current Java version is based on the UnboundedSource
>>>>> framework. Here are the reasons that are currently listed in the doc.
>>>>>
>>>>>
>>>>>-
>>>>>
>>>>>Users might find it useful to have at least one unbounded source
>>>>>and sink combination implemented in Python SDK and Kafka is the 
>>>>> streaming
>>>>>system that makes most sense to support if we just want to add support 
>>>>> for
>>>>>only one such system in Python SDK.
>>>>>-
>>>>>
>>>>>Not all runners might support cross-language IO. Also some
>>>>>user/runner/deployment combinations might require an unbounded 
>>>>> source/sink
>>>>>implemented in Python SDK.
>>>>>-
>>>>>
>>>>>We recently added Splittable DoFn support to Python SDK. It will
>>>>>be good to have at least one production quality Splittable DoFn
>>>>>that will server as a good example for any users who wish to implement 
>>>>> new
>>>>>Splittable DoFn implementations on top of Beam Python SDK.
>>>>>-
>>>>>
>>>>>Cross-language transform feature is currently is in the initial
>>>>>discussion phase and it could be some time before we can offer existing
>>>>>Java implementation of Kafka for Python SDK users.
>>>>>-
>>>>>
>>>>>Cross-language IO might take even longer to reach the point where
>>>>>it's fully equivalent in expressive power to a transform written in the
>>>>>host language - e.g. supporting host-language lambdas as part of the
>>>>>transform configuration is likely to take a lot longer than 
>>>>> "first-order"
>>>>>cross-language IO. KafkaIO in Java uses lambdas as part of transform
>>>>>configuration, e.g. timestamp functions.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> On Mon, Apr 30, 2018 at 2:14 AM Aljoscha Krettek <aljos...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Is this what we want to do in the long run, i.e. implement copies of
>>>>>> connectors for different SDKs? I thought the plan was to enable using
>>>>>> connectors written in different languages, i.e. use the Java Kafka I/O 
>>>>>> from
>>>>>> python. This way we wouldn't duplicate bugs for three different language
>>>>>> (Java, Python, and Go for now).
>>>>>>
>>>>>> Best,
>>>>>> Aljoscha
>>>>>>
>>>>>>
>>>>>> On 29. Apr 2018, at 20:46, Eugene Kirpichov <kirpic...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks Cham, this is great! I left just a couple of comments on the
>>>>>> doc.
>>>>>>
>>>>>> On Fri, Apr 27, 2018 at 10:06 PM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm looking into adding a Kafka connector to Beam Python SDK. I
>>>>>>> think this will benefits many Python SDK users and will serve as a good
>>>>>>> example for recently added Splittable DoFn API (Fn API support which 
>>>>>>> will
>>>>>>> allow all runners to use Python Splittable DoFn is in active 
>>>>>>> development).
>>>>>>> I created a document [1] that makes the case for adding this connector 
>>>>>>> and
>>>>>>> compares the performance of available Python Kafka client libraries. 
>>>>>>> Also I
>>>>>>> created a POC [2] that illustrates the API and how Python SDF API can be
>>>>>>> used to implement a Kafka source. I extremely appreciate any feedback
>>>>>>> related to this.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1ogRS-e-HYYTHsXi_l2zDUUOnvfzEbub3BFkPrYIOawU/edit?usp=sharing
>>>>>>> [2]
>>>>>>> https://github.com/chamikaramj/beam/commit/982767b69198579b22522de6794242142d12c5f9
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cham
>>>>>>>
>>>>>>
>>>>>>

Re: Kafka connector for Beam Python SDK

2018-04-29 Thread Eugene Kirpichov

Thanks Cham, this is great! I left just a couple of comments on the doc.

On Fri, Apr 27, 2018 at 10:06 PM Chamikara Jayalath 
wrote:

> Hi All,
>
> I'm looking into adding a Kafka connector to Beam Python SDK. I think this
> will benefits many Python SDK users and will serve as a good example for
> recently added Splittable DoFn API (Fn API support which will allow all
> runners to use Python Splittable DoFn is in active development).  I created
> a document [1] that makes the case for adding this connector and compares
> the performance of available Python Kafka client libraries. Also I created
> a POC [2] that illustrates the API and how Python SDF API can be used to
> implement a Kafka source. I extremely appreciate any feedback related to
> this.
>
> [1]
> https://docs.google.com/document/d/1ogRS-e-HYYTHsXi_l2zDUUOnvfzEbub3BFkPrYIOawU/edit?usp=sharing
> [2]
> https://github.com/chamikaramj/beam/commit/982767b69198579b22522de6794242142d12c5f9
>
> Thanks,
> Cham
>

Re: Custom URNs and runner translation

2018-04-26 Thread Eugene Kirpichov

I agree with Thomas' sentiment that cross-language IO is very important
because of how much work it takes to produce a mature connector
implementation in a language. Looking at implementations of BigQueryIO,
PubSubIO, KafkaIO, FileIO in Java, only a very daring soul would be tempted
to reimplement them entirely in Python and Go.

I'm imagining pretty much what Kenn is describing: a pipeline would specify
some transforms by URN + payload, and rely on the runner to do whatever it
takes to run this - either by expanding it into a Beam implementation of
this transform that the runner chooses to use (could be in the same
language or in a different language; either way, the runner would indeed
need to invoke the respective SDK to expand it given the parameters), or by
doing something entirely runner-specific (e.g. using the built-in Flink
Kafka connector).

I don't see a reason to require that there *must* exist a Beam
implementation of this transform. There only, ideally, must be a runner-
and language-agnostic spec for the URN and payload; of course, then the
transform is only as portable as the set of runners that implement this URN.

I actually really like the idea that the transform can be implemented in a
completely runner-specific way without a Beam expansion to back it up - it
would let us unblock a lot of the work earlier than full-blown
cross-language IO is delivered or even than SDFs work in all
languages/runners.

On Wed, Apr 25, 2018 at 10:02 PM Kenneth Knowles  wrote:

> It doesn't have to be 1:1 swapping KafkaIO for a Flink Kafka connector,
> right? I was imagining: Python SDK submits pipeline with a KafkaIO (with
> URN + payload) maybe bogus contents. It is replaced with a small Flink
> subgraph, including the native Flink Kafka connector and some compensating
> transfoms to match the required semantics. To me, this is preferable to
> making single-runner transform URNs, since that breaks runner portability
> by definition.
>
> Kenn
>
> On Wed, Apr 25, 2018 at 7:40 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Apr 25, 2018 at 6:57 PM Reuven Lax  wrote:
>>
>>> On Wed, Apr 25, 2018 at 6:51 PM Kenneth Knowles  wrote:
>>>
 The premise of URN + payload is that you can establish a spec. A native
 override still needs to meet the spec - it may still require some
 compensating code. Worrying about weird differences between runners seems
 more about worrying that an adequate spec cannot be determined.

>>>
>>> My point exactly. a SDF-based KafkaIO can run in the middle of a
>>> pipeline. E.g. we could have TextIO producing a list of topics, and the SDF
>>> KafkaIO run after that on this dynamic (not known until runtime) list of
>>> topics. If the native Flink source doesn't work this way, then it doesn't
>>> share the same spec and should have a different URN.
>>>
>>
>> Agree that if they cannot share the same spec, SDF and native transforms
>> warrant different URNs. Native Kafka might be able to support a PCollection
>> of topics/partitions as an input though by utilizing underlying native
>> Flink Kafka connector as a library. On the other hand, we might decide to
>> expand SDF based ParDos into to other transforms before a runner gets a
>> chance to override in which case this kind of replacements will not be
>> possible.
>>
>> Thanks,
>> Cham
>>
>>
>>>
 Runners will already invoke the SDF differently, so users treating
 every detail of some implementation as the spec are doomed.

 Kenn

 On Wed, Apr 25, 2018, 17:04 Reuven Lax  wrote:

>
>
> On Tue, Apr 24, 2018 at 5:52 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>>
>>
>> On Tue, Apr 24, 2018 at 3:44 PM Henning Rohde 
>> wrote:
>>
>>> > Note that a KafkaDoFn still needs to be provided, but could be a
>>> DoFn that
>>> > fails loudly if it's actually called in the short term rather than
>>> a full
>>> > Python implementation.
>>>
>>> For configurable runner-native IO, for now, I think it is reasonable
>>> to use a URN + special data payload directly without a KafkaDoFn --
>>> assuming it's a portable pipeline. That's what we do in Go for
>>> PubSub-on-Dataflow and something similar would work for Kafka-on-Flink 
>>> as
>>> well. I agree that non-native alternative implementation is desirable, 
>>> but
>>> if one is not present we should IMO rather fail at job submission 
>>> instead
>>> of at runtime. I could imagine connectors intrinsic to an execution 
>>> engine
>>> where non-native implementations are not possible.
>>>
>>
>> I think, in this case, KafkaDoFn can be a SDF that would expand
>> similar to any other SDF by default (initial splitting, GBK, and a 
>> map-task
>> equivalent, for example) but a runner (Flink in this case) will be

Re: Apache Beam Jenkins Machines Upgrade

2018-04-26 Thread Eugene Kirpichov

This sounds awesome, thanks to everybody involved!

On Thu, Apr 26, 2018 at 4:28 PM Yifan Zou  wrote:

> Greetings,
>
> Most of you already know about upgrades on Jenkins machines this week. I
> still want to share some details you might interested in.
>
> In April 24th, we had 16 new GCE instances (beam9 - 24) setup under gcloud
> project 'apache-beam-testing' for Beam Jenkins. The old machines are
> currently disconnected and they will be removed in next week. Major updates
> are as follows:
>
> -- Upgraded machine type from n1-standard-4 (4 vCPUs, 15GB RAM) to
> n1-highmem-16 (16 vCPUs, 104GB Ram);
> -- Upgraded OS from Ubuntu 14.04 to 16.04;
> -- Increased quantity of instances to 16;
> -- Installed the latest version of packages and underlying tools;
>
> The overall build duration are dropping and the stability is improved
> since the new powerful machines.  In addition, python tests are no longer
> limited to some specific nodes. It significantly reduce the waiting time in
> job queue. For example, before swapping, the average waiting time of 10
> latest Python PreCommits was 22.2 minutes, and now it drops to 10 seconds!
>
> Special thanks to Alan Myrvold and Jason Kuster for helping on the smooth
> swapping process.
>
> Regards.
> Yifan Zou
>

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov

Kenn - I'm arguing that in Spark SDF style computation can not be expressed
at all, and neither can Beam's timers.

Spark, unlike Flink, does not have a timer facility (only state), and as
far as I can tell its programming model has no other primitive that can map
a finite RDD into an infinite DStream - the only way to create a new
infinite DStream appears to be to write a Receiver.

I cc'd you because I'm wondering whether you've already investigated this
when considering whether timers can be implemented on the Spark runner.

On Tue, Apr 24, 2018 at 2:53 PM Kenneth Knowles <k...@google.com> wrote:

> I don't think I understand what the limitations of timers are that you are
> referring to. FWIW I would say implementing other primitives like SDF is an
> explicit non-goal for Beam state & timers.
>
> I got lost at some point in this thread, but is it actually necessary that
> a bounded PCollection maps to a finite/bounded structure in Spark?
> Skimming, I'm not sure if the problem is that we can't transliterate Beam
> to Spark (this might be a good sign) or that we can't express SDF style
> computation at all (seems far-fetched, but I could be convinced). Does
> doing a lightweight analysis and just promoting some things to be some kind
> of infinite representation help?
>
> Kenn
>
> On Tue, Apr 24, 2018 at 2:37 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Would like to revive this thread one more time.
>>
>> At this point I'm pretty certain that Spark can't support this out of the
>> box and we're gonna have to make changes to Spark.
>>
>> Holden, could you advise who would be some Spark experts (yourself
>> included :) ) who could advise what kind of Spark change would both support
>> this AND be useful to the regular Spark community (non-Beam) so that it has
>> a chance of finding support? E.g. is there any plan in Spark regarding
>> adding timers similar to Flink's or Beam's timers, maybe we could help out
>> with that?
>>
>> +Kenneth Knowles <k...@google.com> because timers suffer from the same
>> problem.
>>
>> On Thu, Apr 12, 2018 at 2:28 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> (resurrecting thread as I'm back from leave)
>>>
>>> I looked at this mode, and indeed as Reuven points out it seems that it
>>> affects execution details, but doesn't offer any new APIs.
>>> Holden - your suggestions of piggybacking an unbounded-per-element SDF
>>> on top of an infinite stream would work if 1) there was just 1 element and
>>> 2) the work was guaranteed to be infinite.
>>>
>>> Unfortunately, both of these assumptions are insufficient. In particular:
>>>
>>> - 1: The SDF is applied to a PCollection; the PCollection itself may be
>>> unbounded; and the unbounded work done by the SDF happens for every
>>> element. E.g. we might have a Kafka topic on which names of Kafka topics
>>> arrive, and we may end up concurrently reading a continuously growing
>>> number of topics.
>>> - 2: The work per element is not necessarily infinite, it's just *not
>>> guaranteed to be finite* - the SDF is allowed at any moment to say
>>> "Okay, this restriction is done for real" by returning stop() from the
>>> @ProcessElement method. Continuing the Kafka example, e.g., it could do
>>> that if the topic/partition being watched is deleted. Having an infinite
>>> stream as a driver of this process would require being able to send a
>>> signal to the stream to stop itself.
>>>
>>> Is it looking like there's any other way this can be done in Spark
>>> as-is, or are we going to have to make changes to Spark to support this?
>>>
>>> On Sun, Mar 25, 2018 at 9:50 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> I mean the new mode is very much in the Dataset not the DStream API
>>>> (although you can use the Dataset API with the old modes too).
>>>>
>>>> On Sun, Mar 25, 2018 at 9:11 PM, Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> But this new mode isn't a semantic change, right? It's moving away
>>>>> from micro batches into something that looks a lot like what Flink does -
>>>>> continuous processing with asynchronous snapshot boundaries.
>>>>>
>>>>> On Sun, Mar 25, 2018 at 9:01 PM Thomas Weise <t...@apache.org> wrote:
>>>>>
>>>>>> Hopefully the new "continuous processing mode" in Spark will enable
>>>>>> SDF implementation (and re

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov

Would like to revive this thread one more time.

At this point I'm pretty certain that Spark can't support this out of the
box and we're gonna have to make changes to Spark.

Holden, could you advise who would be some Spark experts (yourself included
:) ) who could advise what kind of Spark change would both support this AND
be useful to the regular Spark community (non-Beam) so that it has a chance
of finding support? E.g. is there any plan in Spark regarding adding timers
similar to Flink's or Beam's timers, maybe we could help out with that?

+Kenneth Knowles <k...@google.com> because timers suffer from the same
problem.

On Thu, Apr 12, 2018 at 2:28 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> (resurrecting thread as I'm back from leave)
>
> I looked at this mode, and indeed as Reuven points out it seems that it
> affects execution details, but doesn't offer any new APIs.
> Holden - your suggestions of piggybacking an unbounded-per-element SDF on
> top of an infinite stream would work if 1) there was just 1 element and 2)
> the work was guaranteed to be infinite.
>
> Unfortunately, both of these assumptions are insufficient. In particular:
>
> - 1: The SDF is applied to a PCollection; the PCollection itself may be
> unbounded; and the unbounded work done by the SDF happens for every
> element. E.g. we might have a Kafka topic on which names of Kafka topics
> arrive, and we may end up concurrently reading a continuously growing
> number of topics.
> - 2: The work per element is not necessarily infinite, it's just *not
> guaranteed to be finite* - the SDF is allowed at any moment to say "Okay,
> this restriction is done for real" by returning stop() from the
> @ProcessElement method. Continuing the Kafka example, e.g., it could do
> that if the topic/partition being watched is deleted. Having an infinite
> stream as a driver of this process would require being able to send a
> signal to the stream to stop itself.
>
> Is it looking like there's any other way this can be done in Spark as-is,
> or are we going to have to make changes to Spark to support this?
>
> On Sun, Mar 25, 2018 at 9:50 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> I mean the new mode is very much in the Dataset not the DStream API
>> (although you can use the Dataset API with the old modes too).
>>
>> On Sun, Mar 25, 2018 at 9:11 PM, Reuven Lax <re...@google.com> wrote:
>>
>>> But this new mode isn't a semantic change, right? It's moving away from
>>> micro batches into something that looks a lot like what Flink does -
>>> continuous processing with asynchronous snapshot boundaries.
>>>
>>> On Sun, Mar 25, 2018 at 9:01 PM Thomas Weise <t...@apache.org> wrote:
>>>
>>>> Hopefully the new "continuous processing mode" in Spark will enable SDF
>>>> implementation (and real streaming)?
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>>
>>>>> On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov <kirpic...@google.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <
>>>>>>> kirpic...@google.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <
>>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <
>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <
>>>>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Reviving this thread. I think SDF is a pretty big risk for
>>>>>>>>>>>> Spark runner streaming. Holden, is it correct that Spark appears 
>>>>>>>>>>>> to have no
>>>>>>>>>>>> way at all to produce

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-04-19 Thread Eugene Kirpichov

Very cool! JB, time to update your PR?

On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko 
wrote:

> FYI: Apache Parquet 1.10.0 was release recently.
> It contains *org.apache.parquet.io.OutputFile *and updated
> *org.apache.parquet.hadoop.ParquetFileWriter*
>
> WBR,
> Alexey
>
>
> On 14 Feb 2018, at 20:10, Jean-Baptiste Onofré  wrote:
>
> Great !!
>
> In the mean time, I started to PoC around directly parquet-common to see
> if I
> can implement a BeamParquetReader and a BeamParquetWriter.
>
> I might also propose some PRs.
>
> I will continue tomorrow around that.
>
> Thanks again !
> Regards
> JB
>
> On 02/14/2018 08:04 PM, Ryan Blue wrote:
>
> Additions to the builders are easy enough that we can get that in. There's
> a PR out there that needs to be fixed:
> https://github.com/apache/parquet-mr/pull/446
>
> I've asked the author for just the builder changes. If we don't hear back,
> we can add another PR but I'd like to give the author some time to update.
>
> rb
>
> On Tue, Feb 13, 2018 at 9:20 PM, Jean-Baptiste Onofré 
> wrote:
>
> Hi  Ryan,
>
> Thanks for the update.
>
> Ideally for Beam, it would be great to have the AvroParquetReader and
> AvroParquetWriter using the InputFile/OutputFile interfaces. It would
> allow me
> to directly leverage Beam FileIO.
>
> Do you have a rough date for the Parquet release with that ?
>
> Thanks
> Regards
> JB
>
> On 02/14/2018 02:01 AM, Ryan Blue wrote:
>
> Jean-Baptiste,
>
> We're planning a release that will include the new OutputFile class,
>
> which I
>
> think you should be able to use. Is there anything you'd change to make
>
> this
>
> work more easily with Beam?
>
> rb
>
> On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré  > wrote:
>
>Hi guys,
>
>I'm working on the Apache Beam ParquetIO:
>
>https://github.com/apache/beam/pull/1851
>
>
>In Beam, thanks to FileIO, we support several filesystems (HDFS, S3,
>
> ...).
>
>
>If I was able to implement the Read part using AvroParquetReader
>
> leveraging Beam
>
> FileIO, I'm struggling on the writing part.
>
>I have to create ParquetSink implementing FileIO.Sink. Especially, I
>
> have to
>
>implement the open(WritableByteChannel channel) method.
>
>It's not possible to use AvroParquetWriter here as it takes a Path
>
> as argument
>
>(and from the channel, I can only have an OutputStream).
>
>As a workaround, I wanted to use org.apache.parquet.hadoop.
>
> ParquetFileWriter,
>
>providing my own implementation of org.apache.parquet.io
>.OutputFile.
>
>Unfortunately OutputFile (and the updated method in
>
> ParquetFileWriter) exists on
>
>Parquet master branch, but it was different on Parquet 1.9.0.
>
>So, I have two questions:
>- do you plan a Parquet 1.9.1 release including
>
> org.apache.parquet.io
>
>.OutputFile
>and updated org.apache.parquet.hadoop.ParquetFileWriter ?
>- using Parquet 1.9.0, do you have any advice how to use
>AvroParquetWriter/ParquetFileWriter with an OutputStream (or any
>
> object that I
>
>can get from WritableByteChannel) ?
>
>Thanks !
>
>Regards
>JB
>--
>Jean-Baptiste Onofré
>jbono...@apache.org 
>http://blog.nanthrax.net
>Talend - http://www.talend.com
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-04-14 Thread Eugene Kirpichov

Hi all,

The video is now available. I got it from my Strata account and I have
permission to use and share it freely, so I published it on my own YouTube
page (where there's nothing else...). Perhaps it makes sense to add to the
Beam YouTube channel, but AFAIK only a PMC member can do that.

https://www.youtube.com/watch?v=NIn9E5TVoCA

On Tue, Mar 13, 2018 at 3:33 AM James <xumingmi...@gmail.com> wrote:

> Very informative, thanks!
>
> On Fri, Mar 9, 2018 at 4:49 PM Etienne Chauchot <echauc...@apache.org>
> wrote:
>
>> Great !
>>
>> Thanks for sharing.
>>
>> Etienne
>>
>> Le jeudi 08 mars 2018 à 19:49 +, Eugene Kirpichov a écrit :
>>
>> Hey all,
>>
>> The slides for my yesterday's talk at Strata San Jose
>> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
>>  have
>> been posted on the talk page. They may be of interest both to users and IO
>> authors.
>>
>> Thanks.
>>
>>

Re: Splittable DoFN in Spark discussion

2018-04-12 Thread Eugene Kirpichov

(resurrecting thread as I'm back from leave)

I looked at this mode, and indeed as Reuven points out it seems that it
affects execution details, but doesn't offer any new APIs.
Holden - your suggestions of piggybacking an unbounded-per-element SDF on
top of an infinite stream would work if 1) there was just 1 element and 2)
the work was guaranteed to be infinite.

Unfortunately, both of these assumptions are insufficient. In particular:

- 1: The SDF is applied to a PCollection; the PCollection itself may be
unbounded; and the unbounded work done by the SDF happens for every
element. E.g. we might have a Kafka topic on which names of Kafka topics
arrive, and we may end up concurrently reading a continuously growing
number of topics.
- 2: The work per element is not necessarily infinite, it's just *not
guaranteed to be finite* - the SDF is allowed at any moment to say "Okay,
this restriction is done for real" by returning stop() from the
@ProcessElement method. Continuing the Kafka example, e.g., it could do
that if the topic/partition being watched is deleted. Having an infinite
stream as a driver of this process would require being able to send a
signal to the stream to stop itself.

Is it looking like there's any other way this can be done in Spark as-is,
or are we going to have to make changes to Spark to support this?

On Sun, Mar 25, 2018 at 9:50 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> I mean the new mode is very much in the Dataset not the DStream API
> (although you can use the Dataset API with the old modes too).
>
> On Sun, Mar 25, 2018 at 9:11 PM, Reuven Lax <re...@google.com> wrote:
>
>> But this new mode isn't a semantic change, right? It's moving away from
>> micro batches into something that looks a lot like what Flink does -
>> continuous processing with asynchronous snapshot boundaries.
>>
>> On Sun, Mar 25, 2018 at 9:01 PM Thomas Weise <t...@apache.org> wrote:
>>
>>> Hopefully the new "continuous processing mode" in Spark will enable SDF
>>> implementation (and real streaming)?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>>
>>>> On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <
>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <
>>>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>>>>>>>> runner streaming. Holden, is it correct that Spark appears to have 
>>>>>>>>>>> no way
>>>>>>>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we 
>>>>>>>>>>> can
>>>>>>>>>>> somehow dynamically create a new DStream for every initial 
>>>>>>>>>>> restriction,
>>>>>>>>>>> said DStream being obtained using a Receiver that under the hood 
>>>>>>>>>>> actually
>>>>>>>>>>> runs the SDF? (this is of course less efficient than a 
>>>>>>>>>>> timer-capable runner
>>>>>>>>>>> would do, and I have doubts about the fault tolerance)
>>>>>>>>>>>
>>>>>>>>>> So on the streaming side we could simply do it with a fixed
>>>>>>>>>> number of levels on DStreams. It’s not great but it would work.
>>>>>>>>>>
>>>>>>>>> Not sure I understand this. Let me

Re: How can runners make use of sink parallelism?

2018-04-03 Thread Eugene Kirpichov

Hi Shen,

There is no "IO connector API" in Beam (not counting the deprecated Source
API), IO is merely an informal term for a PTransform that interacts in some
way with some external storage system. So whatever question you're asking
about IO connectors, you might as well be asking it about PTransforms in
general. See
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696


To answer your question, then: is it responsibility of a PTransform author
to make sure their code works correctly when different elements of various
PCollection's are processed by downstream ParDo's in parallel? Yes, of
course.

Things like "writing to a single file" are simply implemented by
non-parallel code - e.g. GBK the data onto a single key, and write a ParDo
that takes the single KV and writes the Iterable to the
file. This is, by definition, sequential (modulo windowing/triggering -
different windows and different firings for the same key can still be
processed in parallel).

On Tue, Apr 3, 2018 at 8:56 PM Shen Li  wrote:

> Hi Kenn,
>
> Thanks for the response.
>
> I haven't hit any specific issue yet. I think if the IO connector
> implementation does take parallelism into consideration, runners can
> parallelize primitive transforms in the connector (key-partitioned for GBK
> and stateful ParDo, and round robin for stateless ParDo). For example,
> TextIO first writes a temp file for every bundle, then uses a void key to
> prevent parallelism, and then finalizes the result. It should work properly
> in a distributed environment.
>
> But applications can provide any custom IO connectors, and the runner does
> not know whether a connector can be safely parallelized. Can I assume that
> it is the applications' responsibility to make sure their IO connector
> works correctly when running in parallel?
>
> Thanks,
> Shen
>
> On Tue, Apr 3, 2018 at 6:11 PM, Kenneth Knowles  wrote:
>
>> The runner should generally not need to be aware of any getNumShard() API
>> on a connector. The connector itself is probably a composite transform
>> (with a ParDo or two or three somewhere doing the actual writes) and should
>> be designed to expose available parallelism. Specifying the number of
>> shards actually usually limits the parallelism, versus letting the runner
>> use the maximum allowed parallelism.
>>
>> If the connector does a GBK to gather input elements into a single
>> iterable, then it is a single element and cannot be processed in parallel
>> (except through splittable DoFn, but in that case you may not need to do
>> the GBK in the first place). And converse to that, if the connector does
>> not do a GBK to gather input elements, then the runner is permitted to
>> bundle them any way it wants and process all of them as though in parallel
>> (except for stateful DoFn, in which case you probably don't need the GBK).
>>
>> Bundling is an important way that this works, too, since the
>> @FinishBundle method is really a "flush" method, with @ProcessElement
>> perhaps buffering up elements to be written to e.g. the same file shard. It
>> is not this simple in practice but that gives the idea of how even with
>> unrestricted elementwise parallelism you don't get one shard per element.
>>
>> These are all just ideas, and I'm not the connector expert. But I think
>> the TL;DR is that a runner shouldn't need to know this - have you hit
>> specific issues with a particular connector? That could make this a very
>> productive discussion.
>>
>> Kenn
>>
>> On Mon, Apr 2, 2018 at 1:41 PM Shen Li  wrote:
>>
>>> Hi,
>>>
>>> It seems that there is no Sink base class. Some IO connectors (e.g.,
>>> KafkaIO and TextIO) provide a getNumShard() API. But it is not generally
>>> available for all existing Beam IO connectors and potential custom ones. 
>>> Although
>>> some IO connectors are implemented using ParDo/GBK, it is unclear whether
>>> the runner can directly parallelize those transforms (e.g., what if it only
>>> writes to a single file). Is there a general way for runners to take
>>> advantage of sink parallelism?
>>>
>>> Thanks,
>>> Shen
>>>
>>>
>>>
>

Re: Source split consistency in distributed environment

2018-03-26 Thread Eugene Kirpichov

On Mon, Mar 26, 2018, 2:08 PM Shen Li <cs.she...@gmail.com> wrote:

> Hi Eugene,
>
> Thanks. Does it mean the application cannot dynamically change the
> parallel width of an UnboundedSource during runtime?
>
Correct, it's limited by how many parts it was split into initially. So it
makes sense to initially split into a fairly large number of parts,
assigning more than one at the same time to each worker, so if you need
more workers then you have more parts ready.


> > Correct. split() is applied to a single argument, so there's nothing to
> execute in parallel here. It executes sequentially, and produces a number
> of sources that can then be executed in parallel. It's pretty similar to
> executing a DoFn on a single element.
>
> I was thinking about the following scenario where the split() will be
> called in parallel. Suppose initially the translation phase invoked the
> UnboundedSource#split() API and got 3 sub-sources. It then starts the
> runtime where the source operator has 3 instances running in parallel, one
> for each sub-source. This part works fine, and there will only be one
> split() invocation during the translation. However, say after a while, the
> application would like to increase the source parallelism from 3 to 4. But,
> as it has already finished the translation, this change will be done
> dynamically during runtime. The runtime will add another source instance.
> Then, the four source instances will all call the split() API in parallel.
> If this API consistently returns 4 sub-sources, then each source operator
> instance can retrieve its own sub-source and proceed from there.
>

>
>
> Thanks,
> Shen
>
>
> On Mon, Mar 26, 2018 at 4:48 PM, Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>>
>>
>> On Mon, Mar 26, 2018 at 1:09 PM Shen Li <cs.she...@gmail.com> wrote:
>>
>>> Hi Lukasz,
>>>
>>> Thanks for your response.
>>>
>>> > Each call to split may return a different set of sub sources but they
>>> always represent the entire original source.
>>>
>>> Inconsistent sets of sub-sources prevent runners/engines from calling
>>> the split API in a distributed manner during runtime.
>>>
>> Correct. split() is applied to a single argument, so there's nothing to
>> execute in parallel here. It executes sequentially, and produces a number
>> of sources that can then be executed in parallel. It's pretty similar to
>> executing a DoFn on a single element.
>>
>>
>>> Besides, the splitAtFraction(double fraction) is only available in
>>> BoundedSources. How do you perform dynamic splitting for UnboundedSources?
>>>
>> There is no analogous API for unbounded sources.
>>
>>
>>>
>>> Another question: will Source transforms eventually become deprecated
>>> and be replaced by the SplittableParDo?
>>>
>> Yes; this is already the case in the Portability framework.
>>
>>
>>>
>>> Thanks,
>>> Shen
>>>
>>>
>>>
>>> On Mon, Mar 26, 2018 at 3:41 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> Contractually, the sources returned by splitting must represent the
>>>> original source. Each call to split may return a different set of sub
>>>> sources but they always represent the entire original source.
>>>>
>>>> Note that Dataflow does call split effectively during translation and
>>>> then later calls APIs on sources to perform dynamic splitting[1].
>>>>
>>>> Note, that this is being replaced with SplittableDoFn. Worthwhile to
>>>> look at this doc[2] and presentation[3].
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/28665490f6ad0cad091f4f936a8f113617fd3f27/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L387
>>>> 2: https://s.apache.org/splittable-do-fn
>>>> 3:
>>>> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
>>>>
>>>>
>>>>
>>>> On Mon, Mar 26, 2018 at 11:33 AM Shen Li <cs.she...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Does the split API in Bounded/UnboundedSource guarantee to return the
>>>>> same result if invoked in different parallel instances in a distributed
>>>>> environment?
>>>>>
>>>>> For example, assume the original source can split into 3 sub-sources.
>>>>> Say the runner creates 3 parallel source operator instances (perhaps

Re: Source split consistency in distributed environment

2018-03-26 Thread Eugene Kirpichov

On Mon, Mar 26, 2018 at 1:09 PM Shen Li  wrote:

> Hi Lukasz,
>
> Thanks for your response.
>
> > Each call to split may return a different set of sub sources but they
> always represent the entire original source.
>
> Inconsistent sets of sub-sources prevent runners/engines from calling the
> split API in a distributed manner during runtime.
>
Correct. split() is applied to a single argument, so there's nothing to
execute in parallel here. It executes sequentially, and produces a number
of sources that can then be executed in parallel. It's pretty similar to
executing a DoFn on a single element.


> Besides, the splitAtFraction(double fraction) is only available in
> BoundedSources. How do you perform dynamic splitting for UnboundedSources?
>
There is no analogous API for unbounded sources.


>
> Another question: will Source transforms eventually become deprecated and
> be replaced by the SplittableParDo?
>
Yes; this is already the case in the Portability framework.


>
> Thanks,
> Shen
>
>
>
> On Mon, Mar 26, 2018 at 3:41 PM, Lukasz Cwik  wrote:
>
>> Contractually, the sources returned by splitting must represent the
>> original source. Each call to split may return a different set of sub
>> sources but they always represent the entire original source.
>>
>> Note that Dataflow does call split effectively during translation and
>> then later calls APIs on sources to perform dynamic splitting[1].
>>
>> Note, that this is being replaced with SplittableDoFn. Worthwhile to look
>> at this doc[2] and presentation[3].
>>
>> 1:
>> https://github.com/apache/beam/blob/28665490f6ad0cad091f4f936a8f113617fd3f27/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java#L387
>> 2: https://s.apache.org/splittable-do-fn
>> 3:
>> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
>>
>>
>>
>> On Mon, Mar 26, 2018 at 11:33 AM Shen Li  wrote:
>>
>>> Hi,
>>>
>>> Does the split API in Bounded/UnboundedSource guarantee to return the
>>> same result if invoked in different parallel instances in a distributed
>>> environment?
>>>
>>> For example, assume the original source can split into 3 sub-sources.
>>> Say the runner creates 3 parallel source operator instances (perhaps
>>> running in different servers) and uses each instance to handle 1 of the 3
>>> sub-sources. In this case, if each operator instance invokes the split
>>> method in a distributed manner, will they get the same split result?
>>>
>>> My understanding is that the current API does not guarantee the 3
>>> operator instances will receive the same split result. It is possible that
>>> 1 of the 3 instances receive 4 sub-sources and the other two receives 3.
>>> Or, even if they all get 3 sub-sources, there could be gaps and overlaps in
>>> the data streams. If so, shall we add an API to indicate that whether a
>>> source can split at runtime?
>>>
>>> One solution is to avoid this problem is to split the source at
>>> translation time and directly pass sub-sources to operator instances. But
>>> this is not ideal. The server runs the translation might not have access to
>>> the source (DB, KV, MQ, etc). Or the application may want to dynamically
>>> change the source parallel width at runtime. Hence, the runner/engine
>>> sometimes have to split the source during runtime in a distributed
>>> environment.
>>>
>>> Thanks,
>>> Shen
>>>
>>>
>

Re: Splittable DoFN in Spark discussion

2018-03-24 Thread Eugene Kirpichov

On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>>> runner streaming. Holden, is it correct that Spark appears to have no way
>>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we can
>>>>>> somehow dynamically create a new DStream for every initial restriction,
>>>>>> said DStream being obtained using a Receiver that under the hood actually
>>>>>> runs the SDF? (this is of course less efficient than a timer-capable 
>>>>>> runner
>>>>>> would do, and I have doubts about the fault tolerance)
>>>>>>
>>>>> So on the streaming side we could simply do it with a fixed number of
>>>>> levels on DStreams. It’s not great but it would work.
>>>>>
>>>> Not sure I understand this. Let me try to clarify what SDF demands of
>>>> the runner. Imagine the following case: a file contains a list of "master"
>>>> Kafka topics, on which there are published additional Kafka topics to read.
>>>>
>>>> PCollection masterTopics = TextIO.read().from(masterTopicsFile)
>>>> PCollection nestedTopics =
>>>> masterTopics.apply(ParDo(ReadFromKafkaFn))
>>>> PCollection records = nestedTopics.apply(ParDo(ReadFromKafkaFn))
>>>>
>>>> This exemplifies both use cases of a streaming SDF that emits infinite
>>>> output for every input:
>>>> - Applying it to a finite set of inputs (in this case to the result of
>>>> reading a text file)
>>>> - Applying it to an infinite set of inputs (i.e. having an unbounded
>>>> number of streams being read concurrently, each of the streams themselves
>>>> is unbounded too)
>>>>
>>>> Does the multi-level solution you have in mind work for this case? I
>>>> suppose the second case is harder, so we can focus on that.
>>>>
>>> So none of those are a splittabledofn right?
>>>
>> Not sure what you mean? ReadFromKafkaFn in these examples is a splittable
>> DoFn and we're trying to figure out how to make Spark run it.
>>
>>
> Ah ok, sorry I saw that and for some reason parsed them as old style DoFns
> in my head.
>
> To effectively allow us to union back into the “same” DStream  we’d have
> to end up using Sparks queue streams (or their equivalent custom source
> because of some queue stream limitations), which invites some reliability
> challenges. This might be at the point where I should send a diagram/some
> sample code since it’s a bit convoluted.
>
> The more I think about the jumps required to make the “simple” union
> approach work, the more it seems just using the statemapping for steaming
> is probably more reasonable. Although the state tracking in Spark can be
> somewhat expensive so it would probably make sense to benchmark to see if
> it meets our needs.
>
So the problem is, I don't think this can be made to work using
mapWithState. It doesn't allow a mapping function that emits infinite
output for an input element, directly or not.

Dataflow and Flink, for example, had timer support even before SDFs, and a
timer can set another timer and thus end up doing an infinite amount of
work in a fault tolerant way - so SDF could be implemented on top of that.
But AFAIK spark doesn't have a similar feature, hence my concern.


> But these still are both DStream based rather than Dataset which we might
> want to support (depends on what direction folks take with the runners).
>
> If we wanted to do this in the dataset world looking at a custom
> sink/source would also be an option, (which is effectively what a custom
> queue stream like thing for dstreams requires), but the datasource APIs are
> a bit influx so if we ended up doing things at the edge of what’s allowed
> there’s a good chance we’d have to rewrite it a few times.
>
>
>>> Assuming that we have a given dstream though in Spark we can get the
>>&g

1 2 3 4 >

1 - 100 of 390 matches

Mail list logo