Some time ago, @Jean-Baptiste Onofré <j...@nanthrax.net> made the excellent suggestion that we look into using JsonPath as a selector format for schema fields. This provides a simple and natural way for users to select nested schema fields, as well as wildcards. This would allow users to more simply select nested fields using the Select transform, e.g.:
p.apply(Select.fields("event.userid", "event.location.*"); It would also fit into NewDoFn (Java) like this: @ProcessElement public void process(@Field("userid") String userId, @Field("action.location.*") Location location) { } After some investigation, I believe that we're better off with something very close to a subset of JsonPath, but not precisely JsonPath. JsonPath has many features that are Javascript specific (e.g. the ability to embed Javascript expressions), JsonPath also includes the ability to do complex filtering and aggregation, which I don't think we want here; Beam already provides the ability to do such filtering and aggregation, and it's not needed here. One example of a change: JsonPath queries always begin with $ (representing the root node), and I think we're better off not requiring that so that these queries look more like BeamSql queries. I've created a small ANTLR grammar (which has the advantage that it's easy to extend) for these expressions and have everything working in a branch. However there are a few more features of JsonPath that might be useful here, and I wanted community feedback to see whether it's worth implementing them. The first are array/map slices and selectors. Currently if a schema contains an array (or map) field, you can only select all elements of the array or map. JsonPath however supports selecting and slicing the array. For example, consider the following: @DefaultSchema(JavaFieldSchema.class) public class Event { public final String userId; public final List<Action> actions; } Currently you can apply Select.fields("actions.location"), and that will return a schema containing a list of Locations, one for every action in the original event. If we allowed slicing, you could instead write Select.fields("actions[0:9].locations"), which would do the same but only for the first 10 elements of the array. Is this useful in Beam? It would not be hard to implement, but I want to see what folks think first. The second feature is recursive field selection. The example often given in JsonPath is a Json document containing the inventory for a store. There are lists of subobjects representing books, bicycles, tables, chairs, etc. etc. The JsonPath query "*$..price*" recursively finds every object that has a field named price, and returns those prices; in this case it returns the price of every element in the store. I'm a bit less convinced that recursive field selection is useful in Beam. The usual example for Json involves a document that represents an entire corpus, e.g. a store inventory. In Beam, the schemas are applied to individual records, and I don't know how often there will be a use for this sort of recursive selection. However I could be wrong here, so if anyone has a good use case for this sort of selection, please let me know. Reuven