Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

Luke Cwik Fri, 26 Jun 2020 11:50:10 -0700

I would also like to suggest that transforms that implement ReadAll via
Read should also provide methods like:


// Uses the specified values if unspecified in the input element from the
PCollection<Read>.
withDefaults(Read read);
// Uses the specified values regardless of what the input element from the
PCollection<Read> specifies.
withOverrides(Read read);

and only adds methods that are required at construction time (e.g. coders).
This way the majority of documentation sits on the Read transform.

On Fri, Jun 26, 2020 at 9:58 AM Luke Cwik <[email protected]> wrote:

> Ismael, it is good to hear that using Read as the input didn't have a
> bunch of parameters that were being skipped/ignored. Also, for the
> polymorphism issue you have to rely on the user correctly telling you the
> type in such a way where it is a common ancestor of all the runtime types
> that will ever be used. This usually boils down to something like
> Serializable or DynamicMessage such that the coder that is chosen works for
> all the runtime types. Using multiple types is a valid use case and would
> allow for a simpler graph with less flattens merging the output from
> multiple sources.
>
> Boyuan, as you have mentioned we can have a coder for KafkaIO.Read which
> uses schemas even if some of the parameters can't be represented in a
> meaningful way beyond "bytes". This would be helpful for cross language as
> well since every parameter would become available if a language could
> support it (e.g. it could serialize a java function up front and keep it
> saved as raw bytes within said language). Even if we figure out a better
> way to do this in the future, we'll have to change the schema for the new
> way anyway. This would mean that the external version of the transform
> adopts Row to Read and we drop KafkaSourceDescriptor. The conversion from
> Row to Read could validate that the parameters make sense (e.g. the bytes
> are valid serialized functions). The addition of an
> endReadTime/endReadOffset would make sense for KafkaIO.Read as well and
> this would enable having a bounded version that could be used for backfills
> (this doesn't have to be done as part of any current ongoing PR).
> Essentially any parameter that could be added for a single instance of a
> Kafka element+restriction would also make sense to the KafkaIO.Read
> transform since it too is a single instance. There are parameters that
> would apply to the ReadAll that wouldn't apply to a read and these would be
> global parameters across all element+restriction pairs such as config
> overrides or default values.
>
> I am convinced that we should do as Ismael is suggesting and use
> KafkaIO.Read as the type.
>
>
> On Thu, Jun 25, 2020 at 6:00 PM Chamikara Jayalath <[email protected]>
> wrote:
>
>> Discussion regarding cross-language transforms is a slight tangent here.
>> But I think, in general, it's great if we can use existing transforms (for
>> example, IO connectors) as cross-language transforms without having to
>> build more composites (irrespective of whether in ExternalTransformBuilders
>> or a user pipelines) just to make them cross-language compatible. A future
>> cross-language compatible SchemaCoder might help (assuming that works for
>> Read transform) but I'm not sure we have a good idea when we'll get to that
>> state.
>>
>> Thanks,
>> Cham
>>
>> On Thu, Jun 25, 2020 at 3:13 PM Boyuan Zhang <[email protected]> wrote:
>>
>>> For unbounded SDF in Kafka, we also consider the upgrading/downgrading
>>> compatibility in the pipeline update scenario(For detailed discussion,
>>> please refer to
>>> https://lists.apache.org/thread.html/raf073b8741317244339eb5b2bce844c0f9e0d700c3e4de392fc648d6%40%3Cdev.beam.apache.org%3E).
>>> In order to obtain the compatibility, it requires the input of the read SDF
>>> is schema-aware.
>>>
>>> Thus the major constraint of mapping KafkaSourceDescriptor to
>>> PCollection<Read> is, the KafkaIO.Read also needs to be schema-aware,
>>> otherwise pipeline updates might fail unnecessarily. If looking into
>>> KafkaIO.Read, not all necessary fields are compatible with schema, for
>>> example, SerializedFunction.
>>>
>>> I'm kind of confused by why ReadAll<Read, OutputT> is a common pattern
>>> for SDF based IO. The Read can be a common pattern because the input is
>>> always a PBegin. But for an SDF based IO, the input can be anything. By
>>> using Read as input, we will still have the maintenance cost when SDF IO
>>> supports a new field but Read doesn't consume it. For example, we are
>>> discussing adding endOffset and endReadTime to KafkaSourceDescriptior,
>>> which is not used in KafkaIO.Read.
>>>
>>> On Thu, Jun 25, 2020 at 2:19 PM Ismaël Mejía <[email protected]> wrote:
>>>
>>>> We forgot to mention (5) External.Config used in cross-lang, see KafkaIO
>>>> ExternalTransformBuilder. This approach is the predecessor of (4) and
>>>> probably a
>>>> really good candidate to be replaced by the Row based Configuration
>>>> Boyuan is
>>>> envisioning (so good to be aware of this).
>>>>
>>>> Thanks for the clear explanation Luke you mention the real issue(s).
>>>> All the
>>>> approaches discussed so far in the end could be easily transformed to
>>>> produce a
>>>> PCollection<Read> and those Read Elements could be read by the generic
>>>> ReadAll
>>>> transform. Notice that this can be internal in some IOs e.g. KafkaIO if
>>>> they
>>>> decide not to expose it. I am not saying that we should force every IO
>>>> to
>>>> support ReadAll in its public API but if we do it is probably a good
>>>> idea to be
>>>> consistent with naming the transform that expects an input
>>>> PCollection<Read> in
>>>> the same way. Also notice that using it will save us of the maintenance
>>>> issues
>>>> discussed in my previous email.
>>>>
>>>> Back to the main concern: the consequences of expansion based on Read:
>>>> So far I
>>>> have not seen consequences for the Splitting part which maps really nice
>>>> assuming the Partition info / Restriction is available as part of Read.
>>>> So far
>>>> there are not Serialization because Beam is already enforcing this.
>>>> Notice that
>>>> ReadAll expansion is almost ‘equivalent’ to a poor man SDF at least for
>>>> the
>>>> Bounded case (see the code in my previous email). For the other points:
>>>>
>>>> > a) Do all properties set on a Read apply to the ReadAll? For example,
>>>> the
>>>> > Kafka Read implementation allows you to set the key and value
>>>> deserializers
>>>> > which are also used to dictate the output PCollection type. It also
>>>> allows you
>>>> > to set how the watermark should be computed. Technically a user may
>>>> want the
>>>> > watermark computation to be configurable per Read and they may also
>>>> want an
>>>> > output type which is polymorphic (e.g. Pcollection<Serializable>).
>>>>
>>>> Most of the times they do but for parametric types we cannot support
>>>> different
>>>> types in the outputs of the Read or at least I did not find how to do
>>>> so (is
>>>> there a way to use multiple output Coders on Beam?), we saw this in
>>>> CassandraIO
>>>> and we were discussing adding explicitly these Coders or Serializer
>>>> specific methods to the ReadAll transform. This is less nice because it
>>>> will
>>>> imply some repeated methods, but it is still a compromise to gain the
>>>> other
>>>> advantages. I suppose the watermark case you mention is similar because
>>>> you may
>>>> want the watermark to behave differently in each Read and we probably
>>>> don’t
>>>> support this, so it corresponds to the polymorphic category.
>>>>
>>>> > b) Read extends PTransform which brings its own object modelling
>>>> concerns.
>>>>
>>>> > During the implementations of ReadAll(PCollection<Read>), was it
>>>> discovered
>>>> > that some properties became runtime errors or were ignored if they
>>>> were set?
>>>> > If no, then the code deduplication is likely worth it because we also
>>>> get a
>>>> > lot of javadoc deduplication, but if yes is this an acceptable user
>>>> > experience?
>>>>
>>>> No, not so far. This is an interesting part, notice that the Read
>>>> translation
>>>> ends up delegating the read bits to the ReadFn part of ReadAll so the
>>>> ReadFn is
>>>> the real read and must be aware and use all the parameters.
>>>>
>>>>     @Override
>>>>     public PCollection<SolrDocument> expand(PBegin input) {
>>>>       return input.apply("Create", Create.of(this)).apply("ReadAll",
>>>> readAll());
>>>>     }
>>>>
>>>> I might be missing something for the Unbounded SDF case which is the
>>>> only case
>>>> we have not explored so far. I think one easy way to see the
>>>> limitations would
>>>> be in the ongoing KafkaIO SDF based implementation to try to map
>>>> KafkaSourceDescriptor to do the extra PCollection<Read> and the Read
>>>> logic on
>>>> the ReadAll with the SDF to see which constraints we hit, the
>>>> polymorphic ones
>>>> will be there for sure, maybe others will appear (not sure). However it
>>>> would be
>>>> interesting to see if we have a real gain in the maintenance points,
>>>> but well
>>>> let’s not forget also that KafkaIO has a LOT of knobs so probably the
>>>> generic
>>>> implementation could be relatively complex.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 25, 2020 at 6:30 PM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> > I had mentioned that approach 1 and approach 2 work for cross
>>>> language. The difference being that the cross language transform would take
>>>> a well known definition and convert it to the Read transform. A normal user
>>>> would have a pipeline that would look like:
>>>> > 1: PCollection<Read> -> PTransform(ReadAll) -> PCollection<Output>
>>>> > 2: PCollection<SourceDescriptor> -> PTransform(ReadAll) ->
>>>> PCollection<Output>
>>>> >
>>>> > And in the cross language case this would look like:
>>>> > 1: PCollection<Row of SourceDescriptor> -> PTransform(Convert Row to
>>>> Read) -> PCollection<Read> -> PTransform(ReadAll) -> PCollection<Output>
>>>> > 2: PCollection<Row of SourceDescriptor> -> PTransform(Convert Row to
>>>> SourceDescriptor) -> PCollection<SourceDescriptor> -> PTransform(ReadAll)
>>>> -> PCollection<Output>*
>>>> > * note that PTransform(Convert Row to SourceDescriptor) only exists
>>>> since we haven't solved how to use schemas with language bound types in a
>>>> cross language way. SchemaCoder isn't portable but RowCoder is which is why
>>>> the conversion step exists. We could have a solution for this at some point
>>>> in time.
>>>> >
>>>> > My concern with using Read was around:
>>>> > a) Do all properties set on a Read apply to the ReadAll? For example,
>>>> the Kafka Read implementation allows you to set the key and value
>>>> deserializers which are also used to dictate the output PCollection type.
>>>> It also allows you to set how the watermark should be computed. Technically
>>>> a user may want the watermark computation to be configurable per Read and
>>>> they may also want an output type which is polymorphic (e.g.
>>>> PCollection<Serializable>).
>>>> > b) Read extends PTransform which brings its own object modelling
>>>> concerns.
>>>> >
>>>> > During the implementations of ReadAll(PCollection<Read>), was it
>>>> discovered that some properties became runtime errors or were ignored if
>>>> they were set? If no, then the code deduplication is likely worth it
>>>> because we also get a lot of javadoc deduplication, but if yes is this an
>>>> acceptable user experience?
>>>> >
>>>> >
>>>> > On Thu, Jun 25, 2020 at 7:55 AM Alexey Romanenko <
>>>> [email protected]> wrote:
>>>> >>
>>>> >> I believe that the initial goal of unifying ReadAll as a general
>>>> "PTransform<PCollection<Read>, PCollection<OutputType>>” was to reduce the
>>>> amount of code duplication and error-prone approach related to this. It
>>>> makes much sense since usually we have all needed configuration set in Read
>>>> objects and, as Ismaeil mentioned, ReadAll will consist mostly of only
>>>> Split-Shuffle-Read stages.  So this case usually can be unified by using
>>>> PCollection<Read> as input.
>>>> >>
>>>> >> On the other hand, we have another need to use Java IOs as
>>>> cross-language transforms (as Luke described) which seems only partly in
>>>> common with previous pattern of ReadAll using.
>>>> >>
>>>> >> I’d be more in favour to have only one concept of read configuration
>>>> for all needs but seems it’s not easy and I’d be more in favour with Luke
>>>> and Boyuan approach with schema. Though, maybe ReadAll is not a very
>>>> suitable name in this case because it will can bring some confusions
>>>> related to previous pattern of ReadAll uses.
>>>> >>
>>>> >> On 25 Jun 2020, at 05:00, Boyuan Zhang <[email protected]> wrote:
>>>> >>
>>>> >> Sorry for the typo. I mean I think we can go with (3) and (4): use
>>>> the data type that is schema-aware as the input of ReadAll.
>>>> >>
>>>> >> On Wed, Jun 24, 2020 at 7:42 PM Boyuan Zhang <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> Thanks for the summary, Cham!
>>>> >>>
>>>> >>> I think we can go with (2) and (4): use the data type that is
>>>> schema-aware as the input of ReadAll.
>>>> >>>
>>>> >>> Converting Read into ReadAll helps us to stick with SDF-like IO.
>>>> But only having  (3) is not enough to solve the problem of using ReadAll in
>>>> x-lang case.
>>>> >>>
>>>> >>> The key point of ReadAll is that the input type of ReadAll should
>>>> be able to cross language boundaries and have compatibilities of
>>>> updating/downgrading. After investigating some possibilities(pure java pojo
>>>> with custom coder, protobuf, row/schema) in Kafka usage, we find that
>>>> row/schema fits our needs most. Here comes (4). I believe that using Read
>>>> as input of ReadAll makes sense in some cases, but I also think not all IOs
>>>> have the same need. I would treat Read as a special type as long as the
>>>> Read is schema-aware.
>>>> >>>
>>>> >>> On Wed, Jun 24, 2020 at 6:34 PM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>> >>>>
>>>> >>>> I see. So it seems like there are three options discussed so far
>>>> when it comes to defining source descriptors for ReadAll type transforms
>>>> >>>>
>>>> >>>> (1) Use Read PTransform as the element type of the input
>>>> PCollection
>>>> >>>> (2) Use a POJO that describes the source as the data element of
>>>> the input PCollection
>>>> >>>> (3) Provide a converter as a function to the Read transform which
>>>> essentially will convert it to a ReadAll (what Eugene mentioned)
>>>> >>>>
>>>> >>>> I feel like (3) is more suitable for a related set of source
>>>> descriptions such as files.
>>>> >>>> (1) will allow most code-reuse but seems like will make it hard to
>>>> use the ReadAll transform as a cross-language transform and will break the
>>>> separation of construction time and runtime constructs
>>>> >>>> (2) could result to less code reuse if not careful but will make
>>>> the transform easier to be used as a cross-language transform without
>>>> additional modifications
>>>> >>>>
>>>> >>>> Also, with SDF, we can create ReadAll-like transforms that are
>>>> more efficient. So we might be able to just define all sources in that
>>>> format and make Read transforms just an easy to use composite built on top
>>>> of that (by adding a preceding Create transform).
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Cham
>>>> >>>>
>>>> >>>> On Wed, Jun 24, 2020 at 11:10 AM Luke Cwik <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I believe we do require PTransforms to be serializable since
>>>> anonymous DoFns typically capture the enclosing PTransform.
>>>> >>>>>
>>>> >>>>> On Wed, Jun 24, 2020 at 10:52 AM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>> Seems like Read in PCollection<Read> refers to a transform, at
>>>> least here:
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseIO.java#L353
>>>> >>>>>>
>>>> >>>>>> I'm in favour of separating construction time transforms from
>>>> execution time data objects that we store in PCollections as Luke
>>>> mentioned. Also, we don't guarantee that PTransform is serializable so
>>>> users have the additional complexity of providing a corder whenever a
>>>> PTransform is used as a data object.
>>>> >>>>>> Also, agree with Boyuan that using simple Java objects that are
>>>> convertible to Beam Rows allow us to make these transforms available to
>>>> other SDKs through the cross-language transforms. Using transforms or
>>>> complex sources as data objects will probably make this difficult.
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>> Cham
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Jun 24, 2020 at 10:32 AM Boyuan Zhang <
>>>> [email protected]> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Ismael,
>>>> >>>>>>>
>>>> >>>>>>> I think the ReadAll in the IO connector refers to the IO with
>>>> SDF implementation despite the type of input, where Read refers to
>>>> UnboundedSource.  One major pushback of using KafkaIO.Read as source
>>>> description is that not all configurations of KafkaIO.Read are meaningful
>>>> to populate during execution time. Also when thinking about x-lang useage,
>>>> making source description across language boundaries is also necessary.  As
>>>> Luke mentioned, it's quite easy to infer a Schema from an AutoValue object:
>>>> KafkaSourceDescription.java. Then the coder of this schema-aware object
>>>> will be a SchemaCoder. When crossing language boundaries, it's also easy to
>>>> convert a Row into the source description: Convert.fromRows.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Jun 24, 2020 at 9:51 AM Luke Cwik <[email protected]>
>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> To provide additional context, the KafkaIO ReadAll transform
>>>> takes a PCollection<KafkaSourceDescriptor>. This KafkaSourceDescriptor is a
>>>> POJO that contains the configurable parameters for reading from Kafka. This
>>>> is different from the pattern that Ismael listed because they take
>>>> PCollection<Read> as input and the Read is the same as the Read PTransform
>>>> class used for the non read all case.
>>>> >>>>>>>>
>>>> >>>>>>>> The KafkaSourceDescriptor does lead to duplication since
>>>> parameters used to configure the transform have to be copied over to the
>>>> source descriptor but decouples how a transform is specified from the
>>>> object that describes what needs to be done. I believe Ismael's point is
>>>> that we wouldn't need such a decoupling.
>>>> >>>>>>>>
>>>> >>>>>>>> Another area that hasn't been discussed and I believe is a
>>>> non-issue is that the Beam Java SDK has the most IO connectors and we would
>>>> want to use the IO implementations within Beam Go and Beam Python. This
>>>> brings in its own set of issues related to versioning and compatibility for
>>>> the wire format and how one parameterizes such transforms. The wire format
>>>> issue can be solved with either approach by making sure that the cross
>>>> language expansion always takes the well known format (whatever it may be)
>>>> and converts it into Read/KafkaSourceDescriptor/... object that is then
>>>> passed to the ReadAll transform. Boyuan has been looking to make the
>>>> KafkaSourceDescriptor have a schema so it can be represented as a row and
>>>> this can be done easily using the AutoValue integration (I don't believe
>>>> there is anything preventing someone from writing a schema row -> Read ->
>>>> row adapter or also using the AutoValue configuration if the transform is
>>>> also an AutoValue).
>>>> >>>>>>>>
>>>> >>>>>>>> I would be more for the code duplication and separation of
>>>> concerns provided by using a different object to represent the contents of
>>>> the PCollection from the pipeline construction time PTransform.
>>>> >>>>>>>>
>>>> >>>>>>>> On Wed, Jun 24, 2020 at 9:09 AM Eugene Kirpichov <
>>>> [email protected]> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Hi Ismael,
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thanks for taking this on. Have you considered an approach
>>>> similar (or dual) to FileIO.write(), where we in a sense also have to
>>>> configure a dynamic number different IO transforms of the same type (file
>>>> writes)?
>>>> >>>>>>>>>
>>>> >>>>>>>>> E.g. how in this example we configure many aspects of many
>>>> file writes:
>>>> >>>>>>>>>
>>>> >>>>>>>>> transactions.apply(FileIO.<TransactionType,
>>>> Transaction>writeDynamic()
>>>> >>>>>>>>>      .by(Transaction::getType)
>>>> >>>>>>>>>      .via(tx -> tx.getType().toFields(tx),  // Convert the
>>>> data to be written to CSVSink
>>>> >>>>>>>>>           type -> new CSVSink(type.getFieldNames()))
>>>> >>>>>>>>>      .to(".../path/to/")
>>>> >>>>>>>>>      .withNaming(type -> defaultNaming(type +
>>>> "-transactions", ".csv"));
>>>> >>>>>>>>>
>>>> >>>>>>>>> we could do something similar for many JdbcIO reads:
>>>> >>>>>>>>>
>>>> >>>>>>>>> PCollection<Bar> bars;  // user-specific type from which all
>>>> the read parameters can be inferred
>>>> >>>>>>>>> PCollection<Moo> moos = bars.apply(JdbcIO.<Bar, Moo>readAll()
>>>> >>>>>>>>>   .fromQuery(bar -> ...compute query for this bar...)
>>>> >>>>>>>>>   .withMapper((bar, resultSet) -> new Moo(...))
>>>> >>>>>>>>>   .withBatchSize(bar -> ...compute batch size for this bar...)
>>>> >>>>>>>>>   ...etc);
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Wed, Jun 24, 2020 at 6:53 AM Ismaël Mejía <
>>>> [email protected]> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hello,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> (my excuses for the long email but this requires context)
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As part of the move from Source based IOs to DoFn based
>>>> ones. One pattern
>>>> >>>>>>>>>> emerged due to the composable nature of DoFn. The idea is to
>>>> have a different
>>>> >>>>>>>>>> kind of composable reads where we take a PCollection of
>>>> different sorts of
>>>> >>>>>>>>>> intermediate specifications e.g. tables, queries, etc, for
>>>> example:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> JdbcIO:
>>>> >>>>>>>>>> ReadAll<ParameterT, OutputT> extends
>>>> >>>>>>>>>> PTransform<PCollection<ParameterT>, PCollection<OutputT>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> RedisIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<String>,
>>>> PCollection<KV<String, String>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> HBaseIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<HBaseQuery>,
>>>> PCollection<Result>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> These patterns enabled richer use cases like doing multiple
>>>> queries in the same
>>>> >>>>>>>>>> Pipeline, querying based on key patterns or querying from
>>>> multiple tables at the
>>>> >>>>>>>>>> same time but came with some maintenance issues:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - We ended up needing to add to the ReadAll transforms the
>>>> parameters for
>>>> >>>>>>>>>>   missing information so we ended up with lots of duplicated
>>>> with methods and
>>>> >>>>>>>>>>   error-prone code from the Read transforms into the ReadAll
>>>> transforms.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - When you require new parameters you have to expand the
>>>> input parameters of the
>>>> >>>>>>>>>>   intermediary specification into something that resembles
>>>> the full `Read`
>>>> >>>>>>>>>>   definition for example imagine you want to read from
>>>> multiple tables or
>>>> >>>>>>>>>>   servers as part of the same pipeline but this was not in
>>>> the intermediate
>>>> >>>>>>>>>>   specification you end up adding those extra methods
>>>> (duplicating more code)
>>>> >>>>>>>>>>   just o get close to the be like the Read full spec.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - If new parameters are added to the Read method we end up
>>>> adding them
>>>> >>>>>>>>>>   systematically to the ReadAll transform too so they are
>>>> taken into account.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Due to these issues I recently did a change to test a new
>>>> approach that is
>>>> >>>>>>>>>> simpler, more complete and maintainable. The code became:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> HBaseIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<Read>,
>>>> PCollection<Result>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> With this approach users gain benefits of improvements on
>>>> parameters of normal
>>>> >>>>>>>>>> Read because they count with the full Read parameters. But
>>>> of course there are
>>>> >>>>>>>>>> some minor caveats:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 1. You need to push some information into normal Reads for
>>>> example
>>>> >>>>>>>>>>    partition boundaries information or Restriction
>>>> information (in the SDF
>>>> >>>>>>>>>>    case).  Notice that this consistent approach of ReadAll
>>>> produces a simple
>>>> >>>>>>>>>>    pattern that ends up being almost reusable between IOs
>>>> (e.g. the    non-SDF
>>>> >>>>>>>>>>    case):
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>   public static class ReadAll extends
>>>> PTransform<PCollection<Read>,
>>>> >>>>>>>>>> PCollection<SolrDocument>> {
>>>> >>>>>>>>>>     @Override
>>>> >>>>>>>>>>     public PCollection<SolrDocument>
>>>> expand(PCollection<Read> input) {
>>>> >>>>>>>>>>       return input
>>>> >>>>>>>>>>           .apply("Split", ParDo.of(new SplitFn()))
>>>> >>>>>>>>>>           .apply("Reshuffle", Reshuffle.viaRandomKey())
>>>> >>>>>>>>>>           .apply("Read", ParDo.of(new ReadFn()));
>>>> >>>>>>>>>>     }
>>>> >>>>>>>>>>   }
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 2. If you are using Generic types for the results ReadAll
>>>> you must have the
>>>> >>>>>>>>>>    Coders used in its definition and require consistent
>>>> types from the data
>>>> >>>>>>>>>>    sources, in practice this means we need to add extra
>>>> withCoder method(s) on
>>>> >>>>>>>>>>    ReadAll but not the full specs.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> At the moment HBaseIO and SolrIO already follow this ReadAll
>>>> pattern. RedisIO
>>>> >>>>>>>>>> and CassandraIO have already WIP PRs to do so. So I wanted
>>>> to bring this subject
>>>> >>>>>>>>>> to the mailing list to see your opinions, and if you see any
>>>> sort of issues that
>>>> >>>>>>>>>> we might be missing with this idea.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Also I would like to see if we have consensus to start using
>>>> consistently the
>>>> >>>>>>>>>> terminology of ReadAll transforms based on Read and the
>>>> readAll() method for new
>>>> >>>>>>>>>> IOs (at this point probably outdoing this in the only
>>>> remaining inconsistent
>>>> >>>>>>>>>> place in JdbcIO might not be a good idea but apart of this
>>>> we should be ok).
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> I mention this because the recent PR on KafkaIO based on SDF
>>>> is doing something
>>>> >>>>>>>>>> similar to the old pattern but being called ReadAll and
>>>> maybe it is worth to be
>>>> >>>>>>>>>> consistent for the benefit of users.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Regards,
>>>> >>>>>>>>>> Ismaël
>>>> >>
>>>> >>
>>>>
>>>> On Thu, Jun 25, 2020 at 6:30 PM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> > I had mentioned that approach 1 and approach 2 work for cross
>>>> language. The difference being that the cross language transform would take
>>>> a well known definition and convert it to the Read transform. A normal user
>>>> would have a pipeline that would look like:
>>>> > 1: PCollection<Read> -> PTransform(ReadAll) -> PCollection<Output>
>>>> > 2: PCollection<SourceDescriptor> -> PTransform(ReadAll) ->
>>>> PCollection<Output>
>>>> >
>>>> > And in the cross language case this would look like:
>>>> > 1: PCollection<Row of SourceDescriptor> -> PTransform(Convert Row to
>>>> Read) -> PCollection<Read> -> PTransform(ReadAll) -> PCollection<Output>
>>>> > 2: PCollection<Row of SourceDescriptor> -> PTransform(Convert Row to
>>>> SourceDescriptor) -> PCollection<SourceDescriptor> -> PTransform(ReadAll)
>>>> -> PCollection<Output>*
>>>> > * note that PTransform(Convert Row to SourceDescriptor) only exists
>>>> since we haven't solved how to use schemas with language bound types in a
>>>> cross language way. SchemaCoder isn't portable but RowCoder is which is why
>>>> the conversion step exists. We could have a solution for this at some point
>>>> in time.
>>>> >
>>>> > My concern with using Read was around:
>>>> > a) Do all properties set on a Read apply to the ReadAll? For example,
>>>> the Kafka Read implementation allows you to set the key and value
>>>> deserializers which are also used to dictate the output PCollection type.
>>>> It also allows you to set how the watermark should be computed. Technically
>>>> a user may want the watermark computation to be configurable per Read and
>>>> they may also want an output type which is polymorphic (e.g.
>>>> PCollection<Serializable>).
>>>> > b) Read extends PTransform which brings its own object modelling
>>>> concerns.
>>>> >
>>>> > During the implementations of ReadAll(PCollection<Read>), was it
>>>> discovered that some properties became runtime errors or were ignored if
>>>> they were set? If no, then the code deduplication is likely worth it
>>>> because we also get a lot of javadoc deduplication, but if yes is this an
>>>> acceptable user experience?
>>>> >
>>>> >
>>>> > On Thu, Jun 25, 2020 at 7:55 AM Alexey Romanenko <
>>>> [email protected]> wrote:
>>>> >>
>>>> >> I believe that the initial goal of unifying ReadAll as a general
>>>> "PTransform<PCollection<Read>, PCollection<OutputType>>” was to reduce the
>>>> amount of code duplication and error-prone approach related to this. It
>>>> makes much sense since usually we have all needed configuration set in Read
>>>> objects and, as Ismaeil mentioned, ReadAll will consist mostly of only
>>>> Split-Shuffle-Read stages.  So this case usually can be unified by using
>>>> PCollection<Read> as input.
>>>> >>
>>>> >> On the other hand, we have another need to use Java IOs as
>>>> cross-language transforms (as Luke described) which seems only partly in
>>>> common with previous pattern of ReadAll using.
>>>> >>
>>>> >> I’d be more in favour to have only one concept of read configuration
>>>> for all needs but seems it’s not easy and I’d be more in favour with Luke
>>>> and Boyuan approach with schema. Though, maybe ReadAll is not a very
>>>> suitable name in this case because it will can bring some confusions
>>>> related to previous pattern of ReadAll uses.
>>>> >>
>>>> >> On 25 Jun 2020, at 05:00, Boyuan Zhang <[email protected]> wrote:
>>>> >>
>>>> >> Sorry for the typo. I mean I think we can go with (3) and (4): use
>>>> the data type that is schema-aware as the input of ReadAll.
>>>> >>
>>>> >> On Wed, Jun 24, 2020 at 7:42 PM Boyuan Zhang <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> Thanks for the summary, Cham!
>>>> >>>
>>>> >>> I think we can go with (2) and (4): use the data type that is
>>>> schema-aware as the input of ReadAll.
>>>> >>>
>>>> >>> Converting Read into ReadAll helps us to stick with SDF-like IO.
>>>> But only having  (3) is not enough to solve the problem of using ReadAll in
>>>> x-lang case.
>>>> >>>
>>>> >>> The key point of ReadAll is that the input type of ReadAll should
>>>> be able to cross language boundaries and have compatibilities of
>>>> updating/downgrading. After investigating some possibilities(pure java pojo
>>>> with custom coder, protobuf, row/schema) in Kafka usage, we find that
>>>> row/schema fits our needs most. Here comes (4). I believe that using Read
>>>> as input of ReadAll makes sense in some cases, but I also think not all IOs
>>>> have the same need. I would treat Read as a special type as long as the
>>>> Read is schema-aware.
>>>> >>>
>>>> >>> On Wed, Jun 24, 2020 at 6:34 PM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>> >>>>
>>>> >>>> I see. So it seems like there are three options discussed so far
>>>> when it comes to defining source descriptors for ReadAll type transforms
>>>> >>>>
>>>> >>>> (1) Use Read PTransform as the element type of the input
>>>> PCollection
>>>> >>>> (2) Use a POJO that describes the source as the data element of
>>>> the input PCollection
>>>> >>>> (3) Provide a converter as a function to the Read transform which
>>>> essentially will convert it to a ReadAll (what Eugene mentioned)
>>>> >>>>
>>>> >>>> I feel like (3) is more suitable for a related set of source
>>>> descriptions such as files.
>>>> >>>> (1) will allow most code-reuse but seems like will make it hard to
>>>> use the ReadAll transform as a cross-language transform and will break the
>>>> separation of construction time and runtime constructs
>>>> >>>> (2) could result to less code reuse if not careful but will make
>>>> the transform easier to be used as a cross-language transform without
>>>> additional modifications
>>>> >>>>
>>>> >>>> Also, with SDF, we can create ReadAll-like transforms that are
>>>> more efficient. So we might be able to just define all sources in that
>>>> format and make Read transforms just an easy to use composite built on top
>>>> of that (by adding a preceding Create transform).
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Cham
>>>> >>>>
>>>> >>>> On Wed, Jun 24, 2020 at 11:10 AM Luke Cwik <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I believe we do require PTransforms to be serializable since
>>>> anonymous DoFns typically capture the enclosing PTransform.
>>>> >>>>>
>>>> >>>>> On Wed, Jun 24, 2020 at 10:52 AM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>> Seems like Read in PCollection<Read> refers to a transform, at
>>>> least here:
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseIO.java#L353
>>>> >>>>>>
>>>> >>>>>> I'm in favour of separating construction time transforms from
>>>> execution time data objects that we store in PCollections as Luke
>>>> mentioned. Also, we don't guarantee that PTransform is serializable so
>>>> users have the additional complexity of providing a corder whenever a
>>>> PTransform is used as a data object.
>>>> >>>>>> Also, agree with Boyuan that using simple Java objects that are
>>>> convertible to Beam Rows allow us to make these transforms available to
>>>> other SDKs through the cross-language transforms. Using transforms or
>>>> complex sources as data objects will probably make this difficult.
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>> Cham
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Jun 24, 2020 at 10:32 AM Boyuan Zhang <
>>>> [email protected]> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Ismael,
>>>> >>>>>>>
>>>> >>>>>>> I think the ReadAll in the IO connector refers to the IO with
>>>> SDF implementation despite the type of input, where Read refers to
>>>> UnboundedSource.  One major pushback of using KafkaIO.Read as source
>>>> description is that not all configurations of KafkaIO.Read are meaningful
>>>> to populate during execution time. Also when thinking about x-lang useage,
>>>> making source description across language boundaries is also necessary.  As
>>>> Luke mentioned, it's quite easy to infer a Schema from an AutoValue object:
>>>> KafkaSourceDescription.java. Then the coder of this schema-aware object
>>>> will be a SchemaCoder. When crossing language boundaries, it's also easy to
>>>> convert a Row into the source description: Convert.fromRows.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Jun 24, 2020 at 9:51 AM Luke Cwik <[email protected]>
>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> To provide additional context, the KafkaIO ReadAll transform
>>>> takes a PCollection<KafkaSourceDescriptor>. This KafkaSourceDescriptor is a
>>>> POJO that contains the configurable parameters for reading from Kafka. This
>>>> is different from the pattern that Ismael listed because they take
>>>> PCollection<Read> as input and the Read is the same as the Read PTransform
>>>> class used for the non read all case.
>>>> >>>>>>>>
>>>> >>>>>>>> The KafkaSourceDescriptor does lead to duplication since
>>>> parameters used to configure the transform have to be copied over to the
>>>> source descriptor but decouples how a transform is specified from the
>>>> object that describes what needs to be done. I believe Ismael's point is
>>>> that we wouldn't need such a decoupling.
>>>> >>>>>>>>
>>>> >>>>>>>> Another area that hasn't been discussed and I believe is a
>>>> non-issue is that the Beam Java SDK has the most IO connectors and we would
>>>> want to use the IO implementations within Beam Go and Beam Python. This
>>>> brings in its own set of issues related to versioning and compatibility for
>>>> the wire format and how one parameterizes such transforms. The wire format
>>>> issue can be solved with either approach by making sure that the cross
>>>> language expansion always takes the well known format (whatever it may be)
>>>> and converts it into Read/KafkaSourceDescriptor/... object that is then
>>>> passed to the ReadAll transform. Boyuan has been looking to make the
>>>> KafkaSourceDescriptor have a schema so it can be represented as a row and
>>>> this can be done easily using the AutoValue integration (I don't believe
>>>> there is anything preventing someone from writing a schema row -> Read ->
>>>> row adapter or also using the AutoValue configuration if the transform is
>>>> also an AutoValue).
>>>> >>>>>>>>
>>>> >>>>>>>> I would be more for the code duplication and separation of
>>>> concerns provided by using a different object to represent the contents of
>>>> the PCollection from the pipeline construction time PTransform.
>>>> >>>>>>>>
>>>> >>>>>>>> On Wed, Jun 24, 2020 at 9:09 AM Eugene Kirpichov <
>>>> [email protected]> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Hi Ismael,
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thanks for taking this on. Have you considered an approach
>>>> similar (or dual) to FileIO.write(), where we in a sense also have to
>>>> configure a dynamic number different IO transforms of the same type (file
>>>> writes)?
>>>> >>>>>>>>>
>>>> >>>>>>>>> E.g. how in this example we configure many aspects of many
>>>> file writes:
>>>> >>>>>>>>>
>>>> >>>>>>>>> transactions.apply(FileIO.<TransactionType,
>>>> Transaction>writeDynamic()
>>>> >>>>>>>>>      .by(Transaction::getType)
>>>> >>>>>>>>>      .via(tx -> tx.getType().toFields(tx),  // Convert the
>>>> data to be written to CSVSink
>>>> >>>>>>>>>           type -> new CSVSink(type.getFieldNames()))
>>>> >>>>>>>>>      .to(".../path/to/")
>>>> >>>>>>>>>      .withNaming(type -> defaultNaming(type +
>>>> "-transactions", ".csv"));
>>>> >>>>>>>>>
>>>> >>>>>>>>> we could do something similar for many JdbcIO reads:
>>>> >>>>>>>>>
>>>> >>>>>>>>> PCollection<Bar> bars;  // user-specific type from which all
>>>> the read parameters can be inferred
>>>> >>>>>>>>> PCollection<Moo> moos = bars.apply(JdbcIO.<Bar, Moo>readAll()
>>>> >>>>>>>>>   .fromQuery(bar -> ...compute query for this bar...)
>>>> >>>>>>>>>   .withMapper((bar, resultSet) -> new Moo(...))
>>>> >>>>>>>>>   .withBatchSize(bar -> ...compute batch size for this bar...)
>>>> >>>>>>>>>   ...etc);
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Wed, Jun 24, 2020 at 6:53 AM Ismaël Mejía <
>>>> [email protected]> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hello,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> (my excuses for the long email but this requires context)
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As part of the move from Source based IOs to DoFn based
>>>> ones. One pattern
>>>> >>>>>>>>>> emerged due to the composable nature of DoFn. The idea is to
>>>> have a different
>>>> >>>>>>>>>> kind of composable reads where we take a PCollection of
>>>> different sorts of
>>>> >>>>>>>>>> intermediate specifications e.g. tables, queries, etc, for
>>>> example:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> JdbcIO:
>>>> >>>>>>>>>> ReadAll<ParameterT, OutputT> extends
>>>> >>>>>>>>>> PTransform<PCollection<ParameterT>, PCollection<OutputT>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> RedisIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<String>,
>>>> PCollection<KV<String, String>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> HBaseIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<HBaseQuery>,
>>>> PCollection<Result>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> These patterns enabled richer use cases like doing multiple
>>>> queries in the same
>>>> >>>>>>>>>> Pipeline, querying based on key patterns or querying from
>>>> multiple tables at the
>>>> >>>>>>>>>> same time but came with some maintenance issues:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - We ended up needing to add to the ReadAll transforms the
>>>> parameters for
>>>> >>>>>>>>>>   missing information so we ended up with lots of duplicated
>>>> with methods and
>>>> >>>>>>>>>>   error-prone code from the Read transforms into the ReadAll
>>>> transforms.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - When you require new parameters you have to expand the
>>>> input parameters of the
>>>> >>>>>>>>>>   intermediary specification into something that resembles
>>>> the full `Read`
>>>> >>>>>>>>>>   definition for example imagine you want to read from
>>>> multiple tables or
>>>> >>>>>>>>>>   servers as part of the same pipeline but this was not in
>>>> the intermediate
>>>> >>>>>>>>>>   specification you end up adding those extra methods
>>>> (duplicating more code)
>>>> >>>>>>>>>>   just o get close to the be like the Read full spec.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> - If new parameters are added to the Read method we end up
>>>> adding them
>>>> >>>>>>>>>>   systematically to the ReadAll transform too so they are
>>>> taken into account.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Due to these issues I recently did a change to test a new
>>>> approach that is
>>>> >>>>>>>>>> simpler, more complete and maintainable. The code became:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> HBaseIO:
>>>> >>>>>>>>>> ReadAll extends PTransform<PCollection<Read>,
>>>> PCollection<Result>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> With this approach users gain benefits of improvements on
>>>> parameters of normal
>>>> >>>>>>>>>> Read because they count with the full Read parameters. But
>>>> of course there are
>>>> >>>>>>>>>> some minor caveats:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 1. You need to push some information into normal Reads for
>>>> example
>>>> >>>>>>>>>>    partition boundaries information or Restriction
>>>> information (in the SDF
>>>> >>>>>>>>>>    case).  Notice that this consistent approach of ReadAll
>>>> produces a simple
>>>> >>>>>>>>>>    pattern that ends up being almost reusable between IOs
>>>> (e.g. the    non-SDF
>>>> >>>>>>>>>>    case):
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>   public static class ReadAll extends
>>>> PTransform<PCollection<Read>,
>>>> >>>>>>>>>> PCollection<SolrDocument>> {
>>>> >>>>>>>>>>     @Override
>>>> >>>>>>>>>>     public PCollection<SolrDocument>
>>>> expand(PCollection<Read> input) {
>>>> >>>>>>>>>>       return input
>>>> >>>>>>>>>>           .apply("Split", ParDo.of(new SplitFn()))
>>>> >>>>>>>>>>           .apply("Reshuffle", Reshuffle.viaRandomKey())
>>>> >>>>>>>>>>           .apply("Read", ParDo.of(new ReadFn()));
>>>> >>>>>>>>>>     }
>>>> >>>>>>>>>>   }
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> 2. If you are using Generic types for the results ReadAll
>>>> you must have the
>>>> >>>>>>>>>>    Coders used in its definition and require consistent
>>>> types from the data
>>>> >>>>>>>>>>    sources, in practice this means we need to add extra
>>>> withCoder method(s) on
>>>> >>>>>>>>>>    ReadAll but not the full specs.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> At the moment HBaseIO and SolrIO already follow this ReadAll
>>>> pattern. RedisIO
>>>> >>>>>>>>>> and CassandraIO have already WIP PRs to do so. So I wanted
>>>> to bring this subject
>>>> >>>>>>>>>> to the mailing list to see your opinions, and if you see any
>>>> sort of issues that
>>>> >>>>>>>>>> we might be missing with this idea.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Also I would like to see if we have consensus to start using
>>>> consistently the
>>>> >>>>>>>>>> terminology of ReadAll transforms based on Read and the
>>>> readAll() method for new
>>>> >>>>>>>>>> IOs (at this point probably outdoing this in the only
>>>> remaining inconsistent
>>>> >>>>>>>>>> place in JdbcIO might not be a good idea but apart of this
>>>> we should be ok).
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> I mention this because the recent PR on KafkaIO based on SDF
>>>> is doing something
>>>> >>>>>>>>>> similar to the old pattern but being called ReadAll and
>>>> maybe it is worth to be
>>>> >>>>>>>>>> consistent for the benefit of users.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Regards,
>>>> >>>>>>>>>> Ismaël
>>>> >>
>>>> >>
>>>>
>>>>

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

Reply via email to