Re: Beam Schemas: current status

Maximilian Michels Thu, 30 Aug 2018 06:51:41 -0700

That's a cool feature. Are there any limitations for the schemainference apart from being a Pojo/Bean? Does it supported nested PoJos,e.g. "wrapper.field"?


-Max


On 29.08.18 07:40, Reuven Lax wrote:

I wanted to send a quick note to the community about the current statusof schema-aware PCollections in Beam. As some might remember we had agood discussion last year about the design of these schemas, involvingmany folks from different parts of the community. I sent a summaryearlier this year explaining how schemas has been integrated into theDoFn framework. Much has happened since then, and here are some of thehighlights.
First, I want to emphasize that all the schema-aware classes arecurrently marked @Experimental. Nothing is set in stone yet, so if youhave questions about any decisions made, please start a discussion!
      SQL
The first big milestone for schemas was porting all of BeamSQL to usethe framework, which was done in pr/5956. This was a lot of work,exposed many bugs in the schema implementation, but now provides greatevidence that schemas work!
      Schema inference
Beam can automatically infer schemas from Java POJOs (objects withpublic fields) or JavaBean objects (objects with getter/setter methods).Often you can do this by simply annotating the class. For example:
@DefaultSchema(JavaFieldSchema.class)

publicclassUserEvent{

publicStringuserId;

publicLatLonglocation;

PublicStringcountryCode;

publiclongtransactionCost;

publicdoubletransactionDuration;

publicList<String>traceMessages;

};


@DefaultSchema(JavaFieldSchema.class)

publicclassLatLong{

publicdoublelatitude;

publicdoublelongitude;

}
Beam will automatically infer schemas for these classes! So if you havea PCollection<UserEvent>, it will automatically get the following schema:
UserEvent:

  userId: STRING

  location: ROW(LatLong)

  countryCode: STRING

  transactionCost: INT64

  transactionDuration: DOUBLE

  traceMessages: ARRAY[STRING]]


LatLong:

  latitude: DOUBLE

  longitude: DOUBLE
Now it’s not always possible to annotate the class like this (you maynot own the class definition), so you can also explicitly register thisusing Pipeline:getSchemaRegistry:registerPOJO, and the same for JavaBeans.
      Coders
Beam has a built-in coder for any schema-aware PCollection, largelyremoving the need for users to care about coders. We generate low-levelbytecode (using ByteBuddy) to implement the coder for each schema, sothese coders are quite performant. This provides a better default coderfor Java POJO objects as well. In the past users were recommended to useAvroCoder for pojos, which many have found inefficient. Now there’s amore-efficient solution.
      Utility Transforms
Schemas are already useful for implementers of extensions such as SQL,but the goal was to use them to make Beam itself easier to use. To thisend, I’ve been implementing a library of transforms that allow for easymanipulation of schema PCollections. So far Filter and Select aremerged, Group is about to go out for review (it needs some more javadocand unit tests), and Join is being developed but doesn’t yet have afinal interface.
Filter
Given a PCollection<LatLong>, I want to keep only those in an area ofsouthern manhattan. Well this is easy!
PCollection<LatLong>manhattanEvents =allEvents.apply(Filter

.whereFieldName("latitude",lat ->lat <40.720&&lat >40.699)

.whereFieldName("longitude",long->long<-73.969&&long>-74.747));
Schemas along with lambdas allows us to write this transformdeclaratively. The Filter transform also allows you to register filterfunctions that operate on multiple fields at the same time.
Select
Let’s say that I don’t need all the fields in a row. For instance, I’monly interested in the userId and traceMessages, and don’t care aboutthe location. In that case I can write the following:
PCollection<Row>selected=allEvents.apply(Select.fieldNames(“userId”,“traceMessages”));
BTW, Beam also keeps track of which fields are accessed by a transformIn the future we can automatically insert Selects in front of subgraphsto drop fields that are not referenced in that subgraph.
Group
Group is one of the more advanced transforms. In its most basic form, itprovides a convenient way to group by key:
PCollection<KV<Row,Iterable<UserEvent>>byUserAndCountry =

    allEvents.apply(Group.byFieldNames(“userId”,“countryCode”));


Notice how much more concise this is than using GroupByKey directly!
The Group transform really starts to shine however when you startspecifying aggregations. You can aggregate any field (or fields) andbuild up an output schema based on these aggregations. For example:
PCollection<KV<Row,Row>>aggregated =allEvents.apply(

Group.byFieldNames(“userId”,“countryCode”)

.aggregateField("cost",Sum.ofLongs(),"total_cost")

.aggregateField("cost",Top.<Long>largestFn(10),“top_purchases”)

.aggregateField("transationDuration",ApproximateQuantilesCombineFn.create(21),

              “durationHistogram”)));
This will individually aggregate the specified fields of the input items(by user and country), and generate an output schema for theseaggregations. In this case, the output schema will be the following:
AggregatedSchema:

    total_cost: INT64

    top_purchases: ARRAY[INT64]

    durationHistogram: ARRAY[DOUBLE]
There are some more utility transforms I've written that are worthlooking at such as Convert (which can convert between user types thatshare a schema) and Unnest (flattens nested schemas). There are alsosome others such as Pivot that we should consider writing
There is still a lot to do. All the todo items are reflected in JIRA,however here are some examples of current gaps:
  *

    Support for read-only POJOs (those with final fields) and JavaBean
    (objects without setters).

  *

    Automatic schema inference from more Java types: protocol buffers,
    avro, AutoValue, etc.

  *

    Integration with sources (BigQueryIO, JdbcIO, AvroIO, etc.)

  *

    Support for JsonPath expressions so users can better express nested
    fields. E.g. support expressions of the form
    Select.fields(“field1.field2”, “field3.*”, “field4[0].field5”);

  *

    Schemas still need to be defined in our portability layer so they
    can be used cross language.
If anyone is interested in helping close these gaps, you'll be helpingmake Beam a better, more-usable system!
Reuven

Re: Beam Schemas: current status

Reply via email to