Re: [Go SDK] User Defined Coders

Andrew Pilloud Mon, 07 Jan 2019 16:27:50 -0800

+1 on this. I think we are somewhat lacking on a written design for Schemas
in Java. This would be really useful in driving adoption and expanding to
other languages.


Andrew

On Mon, Jan 7, 2019 at 3:43 PM Robert Burke <rob...@frantil.com> wrote:

> Might I see the design doc (not code) for how they're supposed to look and
> work in Java first? I'd rather not write a document based on a speculative
> understanding of Schemas based on the littany of assumptions I'm making
> about them.
>
>
> On Mon, Jan 7, 2019, 2:35 PM Reuven Lax <re...@google.com> wrote:
>
>> I suggest that we write out a design of what schemas in go would look
>> like and how it would interact with coders. We'll then be in a much better
>> position to decide what the right short-term path forward is. Even if we
>> decide it makes more sense to build up the coder support first, I think
>> this will guide us; e.g. we can build up the coder support in a way that
>> can be extended to full schemas later.
>>
>> Writing up an overview design shouldn't take too much time and I think is
>> definitely worth it.
>>
>> Reuven
>>
>> On Mon, Jan 7, 2019 at 2:12 PM Robert Burke <rob...@frantil.com> wrote:
>>
>>> Kenn has pointed out to me that Coders are not likely going to vanish in
>>> the next  while, in particular over the FnAPI, so having a coder registry
>>> does remain useful, as described by an early adopter in another thread.
>>>
>>> On Fri, Jan 4, 2019, 10:51 AM Robert Burke <rob...@frantil.com> wrote:
>>>
>>>> I think you're right Kenn.
>>>>
>>>> Reuven alluded to the difficulty in inference of what to use between
>>>> AtomicType and the rest, in particular Struct<Schema>.
>>>>
>>>> Go has the additional concerns around Pointer vs Non Pointer types
>>>> which isn't a concern either Python or Java have, but has implications on
>>>> pipeline efficiency that need addressing, in particular, being able to use
>>>> them in a useful fashion in the Go SDK.
>>>>
>>>> I agree that long term, having schemas as a default codec would be
>>>> hugely beneficial for readability, composability, and allows more
>>>> processing to be on the Runner Harness side of a worker. (I'll save the
>>>> rest of my thoughts on Schemas in Go for the other thread, and say no more
>>>> of it here.)
>>>>
>>>> *Regarding my proposal for User Defined Coders:*
>>>>
>>>> To avoid users accidentally preventing themselves from using Schemas in
>>>> the future, I need to remove the ability to override the default coder 
>>>> *(4).
>>>> *Then instead of JSON coding by default *(5)*, the SDK should be doing
>>>> Schema coding. The SDK is already doing the recursive type analysis on
>>>> types at pipeline construction time, so it's not a huge stretch to support
>>>> Schemas using that information in the future, once Runner & FnAPI support
>>>> begins to exist.
>>>>
>>>> *(1)* doesn't seem to need changing, as this is the existing
>>>> AtomicType definition Kenn pointed out.
>>>>
>>>> *(2)* is the specific AtomicType override.
>>>>
>>>> *(3) *is the broader Go specific override for Go's unique interface
>>>> semantics. This most of the cases *(4)* would have covered anyway, but
>>>> in a targeted way.
>>>>
>>>> This should still allow Go users to better control their pipeline, and
>>>> associated performance implications (which is my goal in this change),
>>>> while not making an overall incompatible choice for powerful beam features
>>>> for the common case in the future.
>>>>
>>>> Does that sound right?
>>>>
>>>> On Fri, 4 Jan 2019 at 10:05 Kenneth Knowles <k...@apache.org> wrote:
>>>>
>>>>> On Thu, Jan 3, 2019 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> If a user wants custom encoding for a primitive type, they can create
>>>>>> a byte-array field and wrap that field with a Coder
>>>>>>
>>>>>
>>>>> This is the crux of the issue, right?
>>>>>
>>>>> Roughly, today, we've got:
>>>>>
>>>>>         Schema ::= [ (fieldname, Type) ]
>>>>>
>>>>>         Type ::= AtomicType | Array<Type> | Map<Type, Type> |
>>>>> Struct<Schema>
>>>>>
>>>>>         AtomicType ::= bytes | int{16, 32, 64} | datetime | string |
>>>>> ...
>>>>>
>>>>> To fully replace custom encodings as they exist, you need:
>>>>>
>>>>>         AtomicType ::= bytes<CustomCoder> | ...
>>>>>
>>>>> At this point, an SDK need not surface the concept of "Coder" to a
>>>>> user at all outside the bytes field concept and the wire encoding and
>>>>> efficient should be identical or nearly to what we do with coders today.
>>>>> PCollections in such an SDK have schemas, not coders, so we have
>>>>> successfully turned it completely inside-out relative to how the Java SDK
>>>>> does it. Is that what you have in mind?
>>>>>
>>>>> I really like this, but I agree with Robert that this is a major
>>>>> change that takes a bunch of work and a lot more collaborative thinking in
>>>>> design docs if we hope to get it right/stable.
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>> (this is why I said that todays Coders are simply special cases);
>>>>>> this should be very rare though, as users rarely should care how Beam
>>>>>> encodes a long or a double.
>>>>>>
>>>>>>>
>>>>>>> Offhand, Schemas seem to be an alternative to pipeline construction,
>>>>>>> rather than coders for value serialization, allowing manual field
>>>>>>> extraction code to be omitted. They do not appear to be a fundamental
>>>>>>> approach to achieve it. For example, the grouping operation still needs 
>>>>>>> to
>>>>>>> encode the whole of the object as a value.
>>>>>>>
>>>>>>
>>>>>> Schemas are properties of the data - essentially a Schema is the data
>>>>>> type of a PCollection. In Java Schemas are also understood by ParDo, so 
>>>>>> you
>>>>>> can write a ParDo like this:
>>>>>>
>>>>>> @ProcessElement
>>>>>> public void process(@Field("user") String userId,  @Field("country")
>>>>>> String countryCode) {
>>>>>> }
>>>>>>
>>>>>> These extra functionalities are part of the graph, but they are
>>>>>> enabled by schemas.
>>>>>>
>>>>>>>
>>>>>>> As mentioned, I'm hoping to have a solution for existing coders by
>>>>>>> January's end, so waiting for your documentation doesn't work on that
>>>>>>> timeline.
>>>>>>>
>>>>>>
>>>>>> I don't think we need to wait for all the documentation to be
>>>>>> written.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> That said, they aren't incompatible ideas as demonstrated by the
>>>>>>> Java implementation. The Go SDK remains in an experimental state. We can
>>>>>>> change things should the need arise in the next few months. Further,
>>>>>>> whenever Generics in Go
>>>>>>> <https://go.googlesource.com/proposal/+/master/design/go2draft-generics-overview.md>
>>>>>>> crop up, the existing user surface and execution stack will need to be
>>>>>>> re-written to take advantage of them anyway. That provides an 
>>>>>>> opportunity
>>>>>>> to invert Coder vs Schema dependence while getting a nice performance
>>>>>>> boost, and cleaner code (and deleting much of my code generator).
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> Were I to implement schemas to get the same syntatic benefits as the
>>>>>>> Java API, I'd be leveraging the field annotations Go has. This satisfies
>>>>>>> the protocol buffer issue as well, since generated go protos have name &
>>>>>>> json annotations. Schemas could be extracted that way. These are also
>>>>>>> available to anything using static analysis for more direct generation 
>>>>>>> of
>>>>>>> accessors. The reflective approach would also work, which is excellent 
>>>>>>> for
>>>>>>> development purposes.
>>>>>>>
>>>>>>> The rote code that the schemas were replacing would be able to be
>>>>>>> cobbled together into efficient DoFn and CombineFns for serialization. 
>>>>>>> At
>>>>>>> present, it seems like it could be implemented as a side package that 
>>>>>>> uses
>>>>>>> beam, rather than changing portions of the core beam Go packages, The 
>>>>>>> real
>>>>>>> trick would be to do so without "apply" since that's not how the Go SDK 
>>>>>>> is
>>>>>>> shaped.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov <g...@spotify.com> wrote:
>>>>>>>
>>>>>>>> Reuven, it sounds great. I see there is a similar thing to Row
>>>>>>>> coders happening in Apache Arrow <https://arrow.apache.org>, and
>>>>>>>> there is a similarity between Apache Arrow Flight
>>>>>>>> <https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018/23>
>>>>>>>> and data exchange service in portability. How do you see these two 
>>>>>>>> things
>>>>>>>> relate to each other in the long term?
>>>>>>>>
>>>>>>>> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The biggest advantage is actually readability and usability. A
>>>>>>>>> secondary advantage is that it means that Go will be able to interact
>>>>>>>>> seamlessly with BeamSQL, which would be a big win for Go.
>>>>>>>>>
>>>>>>>>> A schema is basically a way of saying that a record has a specific
>>>>>>>>> set of (possibly nested, possibly repeated) fields. So for instance 
>>>>>>>>> let's
>>>>>>>>> say that the user's type is a struct with fields named user, country,
>>>>>>>>> purchaseCost. This allows us to provide transforms that operate on 
>>>>>>>>> field
>>>>>>>>> names. Some example (using the Java API):
>>>>>>>>>
>>>>>>>>> PCollection users = events.apply(Select.fields("user"));  //
>>>>>>>>> Select out only the user field.
>>>>>>>>>
>>>>>>>>> PCollection joinedEvents =
>>>>>>>>> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
>>>>>>>>> PCollections by user.
>>>>>>>>>
>>>>>>>>> // For each country, calculate the total purchase cost as well as
>>>>>>>>> the top 10 purchases.
>>>>>>>>> // A new schema is created containing fields total_cost and
>>>>>>>>> top_purchases, and rows are created with the aggregation results.
>>>>>>>>> PCollection purchaseStatistics = events.apply(
>>>>>>>>>     Group.byFieldNames("country")
>>>>>>>>>                .aggregateField("purchaseCost", Sum.ofLongs(),
>>>>>>>>> "total_cost"))
>>>>>>>>>                 .aggregateField("purchaseCost",
>>>>>>>>> Top.largestLongs(10), "top_purchases"))
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is far more readable than what we have today, and what
>>>>>>>>> unlocks this is that Beam actually knows the structure of the record
>>>>>>>>> instead of assuming records are uncrackable blobs.
>>>>>>>>>
>>>>>>>>> Note that a coder is basically a special case of a schema that has
>>>>>>>>> a single field.
>>>>>>>>>
>>>>>>>>> In BeamJava we have a SchemaRegistry which knows how to turn user
>>>>>>>>> types into schemas. We use reflection to analyze many user types (e.g.
>>>>>>>>> simple POJO structs, JavaBean classes, Avro records, protocol buffers,
>>>>>>>>> etc.) to determine the schema, however this is done only when the 
>>>>>>>>> graph is
>>>>>>>>> initially generated. We do use code generation (in Java we do bytecode
>>>>>>>>> generation) to make this somewhat more efficient. I'm willing to bet 
>>>>>>>>> that
>>>>>>>>> the code generator you've written for structs could be very easily 
>>>>>>>>> modified
>>>>>>>>> for schemas instead, so it would not be wasted work if we went with 
>>>>>>>>> schemas.
>>>>>>>>>
>>>>>>>>> One of the things I'm working on now is documenting Beam schemas.
>>>>>>>>> They are already very powerful and useful, but since there is still 
>>>>>>>>> nothing
>>>>>>>>> in our documentation about them, they are not yet widely used. I 
>>>>>>>>> expect to
>>>>>>>>> finish draft documentation by the end of January.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke <r...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That's an interesting idea. I must confess I don't rightly know
>>>>>>>>>> the difference between a schema and coder, but here's what I've got 
>>>>>>>>>> with a
>>>>>>>>>> bit of searching through memory and the mailing list. Please let me 
>>>>>>>>>> know if
>>>>>>>>>> I'm off track.
>>>>>>>>>>
>>>>>>>>>> As near as I can tell, a schema, as far as Beam takes it
>>>>>>>>>> <https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java>
>>>>>>>>>>  is
>>>>>>>>>> a mechanism to define what data is extracted from a given row of 
>>>>>>>>>> data. So
>>>>>>>>>> in principle, there's an opportunity to be more efficient with data 
>>>>>>>>>> with
>>>>>>>>>> many columns that aren't being used, and only extract the data that's
>>>>>>>>>> meaningful to the pipeline.
>>>>>>>>>> The trick then is how to apply the schema to a given
>>>>>>>>>> serialization format, which is something I'm missing in my mental 
>>>>>>>>>> model
>>>>>>>>>> (and then how to do it efficiently in Go).
>>>>>>>>>>
>>>>>>>>>> I do know that the Go client package for BigQuery
>>>>>>>>>> <https://godoc.org/cloud.google.com/go/bigquery#hdr-Schemas>
>>>>>>>>>> does something like that, using field tags. Similarly, the
>>>>>>>>>> "encoding/json"
>>>>>>>>>> <https://golang.org/doc/articles/json_and_go.html> package in
>>>>>>>>>> the Go Standard Library permits annotating fields and it will read 
>>>>>>>>>> out and
>>>>>>>>>> deserialize the JSON fields and that's it.
>>>>>>>>>>
>>>>>>>>>> A concern I have is that Go (at present) would require
>>>>>>>>>> pre-compile time code generation for schemas to be efficient, and 
>>>>>>>>>> they
>>>>>>>>>> would still mostly boil down to turning []bytes into real structs. Go
>>>>>>>>>> reflection doesn't keep up.
>>>>>>>>>> Go has no mechanism I'm aware of to Just In Time compile more
>>>>>>>>>> efficient processing of values.
>>>>>>>>>> It's also not 100% clear how Schema's would play with protocol
>>>>>>>>>> buffers or similar.
>>>>>>>>>> BigQuery has a mechanism of generating a JSON schema from a proto
>>>>>>>>>> file
>>>>>>>>>> <https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema>,
>>>>>>>>>> but that's only the specification half, not the using half.
>>>>>>>>>>
>>>>>>>>>> As it stands, the code generator I've been building these last
>>>>>>>>>> months could (in principle) statically analyze a user's struct, and 
>>>>>>>>>> then
>>>>>>>>>> generate an efficient dedicated coder for it. It just has no where 
>>>>>>>>>> to put
>>>>>>>>>> them such that the Go SDK would use it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax <re...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'll make a different suggestion. There's been some chatter that
>>>>>>>>>>> schemas are a better tool than coders, and that in Beam 3.0 we 
>>>>>>>>>>> should make
>>>>>>>>>>> schemas the basic semantics instead of coders. Schemas provide 
>>>>>>>>>>> everything a
>>>>>>>>>>> coder provides, but also allows for far more readable code. We 
>>>>>>>>>>> can't make
>>>>>>>>>>> such a change in Beam Java 2.X for compatibility reasons, but maybe 
>>>>>>>>>>> in Go
>>>>>>>>>>> we're better off starting with schemas instead of coders?
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke <rob...@frantil.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> One area that the Go SDK currently lacks: is the ability for
>>>>>>>>>>>> users to specify their own coders for types.
>>>>>>>>>>>>
>>>>>>>>>>>> I've written a proposal document,
>>>>>>>>>>>> <https://docs.google.com/document/d/1kQwx4Ah6PzG8z2ZMuNsNEXkGsLXm6gADOZaIO7reUOg/edit#>
>>>>>>>>>>>>  and
>>>>>>>>>>>> while I'm confident about the core, there are certainly some edge 
>>>>>>>>>>>> cases
>>>>>>>>>>>> that require discussion before getting on with the implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> At presently, the SDK only permits primitive value types (all
>>>>>>>>>>>> numeric types but complex, strings, and []bytes) which are coded 
>>>>>>>>>>>> with beam
>>>>>>>>>>>> coders, and structs whose exported fields are of those type, which 
>>>>>>>>>>>> is then
>>>>>>>>>>>> encoded as JSON. Protocol buffer support is hacked in to avoid the 
>>>>>>>>>>>> type
>>>>>>>>>>>> anaiyzer, and presents the current work around this issue.
>>>>>>>>>>>>
>>>>>>>>>>>> The high level proposal is to catch up with Python and Java,
>>>>>>>>>>>> and have a coder registry. In addition, arrays, and maps should be
>>>>>>>>>>>> permitted as well.
>>>>>>>>>>>>
>>>>>>>>>>>> If you have alternatives, or other suggestions and opinions,
>>>>>>>>>>>> I'd love to hear them! Otherwise my intent is to get a PR ready by 
>>>>>>>>>>>> the end
>>>>>>>>>>>> of January.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Robert Burke
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> http://go/where-is-rebo
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>> Gleb
>>>>>>>>
>>>>>>>

Re: [Go SDK] User Defined Coders

Reply via email to