I'll take a look. Honestly though, if we leave out features such as array slices this is a dirt-simple path syntax, that pretty much matches what SQL does. It's basically just field1.field2, or field1.*.
JMESPath along with JsonPath also supports various aggregations, which I think is beyond the scope of what we want here; all that's needed here is a selector expression. AFAICT what I have is already a strict subset of JMESPath, though I'll take a closer look to make sure there are no semantic incompatibilities. Reuven On Mon, Jan 7, 2019 at 10:21 AM Jeff Klukas <jklu...@mozilla.com> wrote: > There is also JMESPath (http://jmespath.org/) which is quite similar to > JsonPath, but does have a spec and lacks the leading $ character. The AWS > CLI uses JMESPath for defining queries. > > > > On Mon, Jan 7, 2019 at 1:05 PM Reuven Lax <re...@google.com> wrote: > >> >> >> On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <re...@google.com> wrote: >>> > >>> > Some time ago, @Jean-Baptiste Onofré made the excellent suggestion >>> that we look into using JsonPath as a selector format for schema fields. >>> This provides a simple and natural way for users to select nested schema >>> fields, as well as wildcards. This would allow users to more simply select >>> nested fields using the Select transform, e.g.: >>> > >>> > p.apply(Select.fields("event.userid", "event.location.*"); >>> > >>> > It would also fit into NewDoFn (Java) like this: >>> > >>> > @ProcessElement >>> > public void process(@Field("userid") String userId, >>> > @Field("action.location.*") Location location) { >>> > } >>> > >>> > After some investigation, I believe that we're better off with >>> something very close to a subset of JsonPath, but not precisely JsonPath. >>> >>> I am very wary of creating something that's very close to, but not >>> quite, a (subset of) a well established standard. Is there >>> disadvantage to not being a strict actual subset? If we go this route, >>> we should at least ensure that any divergence is illegal JsonPath >>> rather than having different semantic meaning. >>> >> >> As far as I can tell, JsonPath isn't much of a "standard." There doesn't >> seem to be much of a spec other than implementation. >> >> For the most part, I am speaking of a strict subset of JsonPath. The only >> incompatibility is that JsonPath expressions all start with a '$' (which >> represents the root node). So in the above expression you would write >> "$.action.location.*" instead. I think staying closer to BeamSql syntax >> makes more sense here, and I would like to dispense with the need to begin >> with a $ character. JsonPath also assumes that each object is also a >> JavaScript object (which makes no sense here), and some of the JsonPath >> features are based on that. >> >> >>> > JsonPath has many features that are Javascript specific (e.g. the >>> ability to embed Javascript expressions), JsonPath also includes the >>> ability to do complex filtering and aggregation, which I don't think we >>> want here; Beam already provides the ability to do such filtering and >>> aggregation, and it's not needed here. One example of a change: JsonPath >>> queries always begin with $ (representing the root node), and I think we're >>> better off not requiring that so that these queries look more like BeamSql >>> queries. >>> > >>> > I've created a small ANTLR grammar (which has the advantage that it's >>> easy to extend) for these expressions and have everything working in a >>> branch. However there are a few more features of JsonPath that might be >>> useful here, and I wanted community feedback to see whether it's worth >>> implementing them. >>> > >>> > The first are array/map slices and selectors. Currently if a schema >>> contains an array (or map) field, you can only select all elements of the >>> array or map. JsonPath however supports selecting and slicing the array. >>> For example, consider the following: >>> > >>> > @DefaultSchema(JavaFieldSchema.class) >>> > public class Event { >>> > public final String userId; >>> > public final List<Action> actions; >>> > } >>> > >>> > Currently you can apply Select.fields("actions.location"), and that >>> will return a schema containing a list of Locations, one for every action >>> in the original event. If we allowed slicing, you could instead write >>> Select.fields("actions[0:9].locations"), which would do the same but only >>> for the first 10 elements of the array. >>> > >>> > Is this useful in Beam? It would not be hard to implement, but I want >>> to see what folks think first. >>> > >>> > The second feature is recursive field selection. The example often >>> given in JsonPath is a Json document containing the inventory for a store. >>> There are lists of subobjects representing books, bicycles, tables, chairs, >>> etc. etc. The JsonPath query "$..price" recursively finds every object that >>> has a field named price, and returns those prices; in this case it returns >>> the price of every element in the store. >>> > >>> > I'm a bit less convinced that recursive field selection is useful in >>> Beam. The usual example for Json involves a document that represents an >>> entire corpus, e.g. a store inventory. In Beam, the schemas are applied to >>> individual records, and I don't know how often there will be a use for this >>> sort of recursive selection. However I could be wrong here, so if anyone >>> has a good use case for this sort of selection, please let me know. >>> >>> Records often contain lists, e.g. the record could be an order, and it >>> could be useful to select on the price of the items (just to throw it >>> out there). >>> >> >> BTW, that already works. The .. operator in JsonPath is a recursive field >> search, across any lists or records that are lower in the tree. >> >