Query expressions for schema fields

Reuven Lax Sun, 06 Jan 2019 03:46:42 -0800

Some time ago, @Jean-Baptiste Onofré <j...@nanthrax.net> made the excellent
suggestion that we look into using JsonPath as a selector format for schema
fields. This provides a simple and natural way for users to select nested
schema fields, as well as wildcards. This would allow users to more simply
select nested fields using the Select transform, e.g.:


p.apply(Select.fields("event.userid", "event.location.*");

It would also fit into NewDoFn (Java) like this:

@ProcessElement
public void process(@Field("userid") String userId,
                    @Field("action.location.*") Location location) {
}

After some investigation, I believe that we're better off with something
very close to a subset of JsonPath, but not precisely JsonPath. JsonPath
has many features that are Javascript specific (e.g. the ability to embed
Javascript expressions), JsonPath also includes the ability to do complex
filtering and aggregation, which I don't think we want here; Beam already
provides the ability to do such filtering and aggregation, and it's not
needed here. One example of a change: JsonPath queries always begin with $
(representing the root node), and I think we're better off not requiring
that so that these queries look more like BeamSql queries.

I've created a small ANTLR grammar (which has the advantage that it's easy
to extend) for these expressions and have everything working in a branch.
However there are a few more features of JsonPath that might be useful
here, and I wanted community feedback to see whether it's worth
implementing them.

The first are array/map slices and selectors. Currently if a schema
contains an array (or map) field, you can only select all elements of the
array or map. JsonPath however supports selecting and slicing the array.
For example, consider the following:

@DefaultSchema(JavaFieldSchema.class)
public class Event {
  public final String userId;
  public final List<Action> actions;
}

Currently you can apply Select.fields("actions.location"), and that will
return a schema containing a list of Locations, one for every action in the
original event. If we allowed slicing,  you could instead write
Select.fields("actions[0:9].locations"), which would do the same but only
for the first 10 elements of the array.

Is this useful in Beam? It would not be hard to implement, but I want to
see what folks think first.

The second feature is recursive field selection. The example often given in
JsonPath is a Json document containing the inventory for a store. There are
lists of subobjects representing books, bicycles, tables, chairs, etc. etc.
The JsonPath query "*$..price*" recursively finds every object that has a
field named price, and returns those prices; in this case it returns the
price of every element in the store.

I'm a bit less convinced that recursive field selection is useful in Beam.
The usual example for Json involves a document that represents an entire
corpus, e.g. a store inventory. In Beam, the schemas are applied to
individual records, and I don't know how often there will be a use for this
sort of recursive selection. However I could be wrong here, so if anyone
has a good use case for this sort of selection, please let me know.

Reuven

Query expressions for schema fields

Reply via email to