Re: Query expressions for schema fields

Robert Burke Mon, 07 Jan 2019 14:14:25 -0800

In the eventual future where the Go SDK supports schemas, it should be
possible to use struct Field Tags to specify paths for extraction from
schema data, for usage similar to what Java uses parameter Annotations.


eg.

type MyKey struct {
    K string `jsonpath:userid`
}
type MyValue struct{
  K string `jsonpath:userid`
  Loc []Location  `jsonpath:action.location.*`
}

func MyDoFn(k MyKey, v MyValue) (...) {...}

One could likely access any number of schema fields this way, and it would
be statically analyisable, so fast extraction would be possible at runtime,
rather than the default reflection paths.

The would be agnostic to whichever path approach is decided on as the beam
standard approach.

On Mon, Jan 7, 2019, 11:59 AM Reuven Lax <re...@google.com> wrote:

> I'll take a look.
>
> Honestly though, if we leave out features such as array slices this is a
> dirt-simple path syntax, that pretty much matches what SQL does.  It's
> basically just field1.field2, or field1.*.
>
> JMESPath along with JsonPath also supports various aggregations, which I
> think is beyond the scope of what we want here; all that's needed here is a
> selector expression. AFAICT what I have is already a strict subset of
> JMESPath, though I'll take a closer look to make sure there are no semantic
> incompatibilities.
>
> Reuven
>
> On Mon, Jan 7, 2019 at 10:21 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>
>> There is also JMESPath (http://jmespath.org/) which is quite similar to
>> JsonPath, but does have a spec and lacks the leading $ character. The AWS
>> CLI uses JMESPath for defining queries.
>>
>>
>>
>> On Mon, Jan 7, 2019 at 1:05 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <re...@google.com> wrote:
>>>> >
>>>> > Some time ago, @Jean-Baptiste Onofré made the excellent suggestion
>>>> that we look into using JsonPath as a selector format for schema fields.
>>>> This provides a simple and natural way for users to select nested schema
>>>> fields, as well as wildcards. This would allow users to more simply select
>>>> nested fields using the Select transform, e.g.:
>>>> >
>>>> > p.apply(Select.fields("event.userid", "event.location.*");
>>>> >
>>>> > It would also fit into NewDoFn (Java) like this:
>>>> >
>>>> > @ProcessElement
>>>> > public void process(@Field("userid") String userId,
>>>> >                     @Field("action.location.*") Location location) {
>>>> > }
>>>> >
>>>> > After some investigation, I believe that we're better off with
>>>> something very close to a subset of JsonPath, but not precisely JsonPath.
>>>>
>>>> I am very wary of creating something that's very close to, but not
>>>> quite, a (subset of) a well established standard. Is there
>>>> disadvantage to not being a strict actual subset? If we go this route,
>>>> we should at least ensure that any divergence is illegal JsonPath
>>>> rather than having different semantic meaning.
>>>>
>>>
>>> As far as I can tell, JsonPath isn't much of a "standard." There doesn't
>>> seem to be much of a spec other than implementation.
>>>
>>> For the most part, I am speaking of a strict subset of JsonPath. The
>>> only incompatibility is that JsonPath expressions all start with a '$'
>>> (which represents the root node). So in the above expression you would
>>> write "$.action.location.*" instead. I think staying closer to BeamSql
>>> syntax makes more sense here, and I would like to dispense with the need to
>>> begin with a $ character. JsonPath also assumes that each object is also a
>>> JavaScript object (which makes no sense here), and some of the JsonPath
>>> features are based on that.
>>>
>>>
>>>> > JsonPath has many features that are Javascript specific (e.g. the
>>>> ability to embed Javascript expressions), JsonPath also includes the
>>>> ability to do complex filtering and aggregation, which I don't think we
>>>> want here; Beam already provides the ability to do such filtering and
>>>> aggregation, and it's not needed here. One example of a change: JsonPath
>>>> queries always begin with $ (representing the root node), and I think we're
>>>> better off not requiring that so that these queries look more like BeamSql
>>>> queries.
>>>> >
>>>> > I've created a small ANTLR grammar (which has the advantage that it's
>>>> easy to extend) for these expressions and have everything working in a
>>>> branch. However there are a few more features of JsonPath that might be
>>>> useful here, and I wanted community feedback to see whether it's worth
>>>> implementing them.
>>>> >
>>>> > The first are array/map slices and selectors. Currently if a schema
>>>> contains an array (or map) field, you can only select all elements of the
>>>> array or map. JsonPath however supports selecting and slicing the array.
>>>> For example, consider the following:
>>>> >
>>>> > @DefaultSchema(JavaFieldSchema.class)
>>>> > public class Event {
>>>> >   public final String userId;
>>>> >   public final List<Action> actions;
>>>> > }
>>>> >
>>>> > Currently you can apply Select.fields("actions.location"), and that
>>>> will return a schema containing a list of Locations, one for every action
>>>> in the original event. If we allowed slicing,  you could instead write
>>>> Select.fields("actions[0:9].locations"), which would do the same but only
>>>> for the first 10 elements of the array.
>>>> >
>>>> > Is this useful in Beam? It would not be hard to implement, but I want
>>>> to see what folks think first.
>>>> >
>>>> > The second feature is recursive field selection. The example often
>>>> given in JsonPath is a Json document containing the inventory for a store.
>>>> There are lists of subobjects representing books, bicycles, tables, chairs,
>>>> etc. etc. The JsonPath query "$..price" recursively finds every object that
>>>> has a field named price, and returns those prices; in this case it returns
>>>> the price of every element in the store.
>>>> >
>>>> > I'm a bit less convinced that recursive field selection is useful in
>>>> Beam. The usual example for Json involves a document that represents an
>>>> entire corpus, e.g. a store inventory. In Beam, the schemas are applied to
>>>> individual records, and I don't know how often there will be a use for this
>>>> sort of recursive selection. However I could be wrong here, so if anyone
>>>> has a good use case for this sort of selection, please let me know.
>>>>
>>>> Records often contain lists, e.g. the record could be an order, and it
>>>> could be useful to select on the price of the items (just to throw it
>>>> out there).
>>>>
>>>
>>> BTW, that already works. The .. operator in JsonPath is a recursive
>>> field search, across any lists or records that are lower in the tree.
>>>
>>

Re: Query expressions for schema fields

Reply via email to