Re: Query expressions for schema fields

Reuven Lax Mon, 07 Jan 2019 11:59:20 -0800

I'll take a look.

Honestly though, if we leave out features such as array slices this is a
dirt-simple path syntax, that pretty much matches what SQL does.  It's
basically just field1.field2, or field1.*.


JMESPath along with JsonPath also supports various aggregations, which I
think is beyond the scope of what we want here; all that's needed here is a
selector expression. AFAICT what I have is already a strict subset of
JMESPath, though I'll take a closer look to make sure there are no semantic
incompatibilities.

Reuven

On Mon, Jan 7, 2019 at 10:21 AM Jeff Klukas <jklu...@mozilla.com> wrote:

> There is also JMESPath (http://jmespath.org/) which is quite similar to
> JsonPath, but does have a spec and lacks the leading $ character. The AWS
> CLI uses JMESPath for defining queries.
>
>
>
> On Mon, Jan 7, 2019 at 1:05 PM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <re...@google.com> wrote:
>>> >
>>> > Some time ago, @Jean-Baptiste Onofré made the excellent suggestion
>>> that we look into using JsonPath as a selector format for schema fields.
>>> This provides a simple and natural way for users to select nested schema
>>> fields, as well as wildcards. This would allow users to more simply select
>>> nested fields using the Select transform, e.g.:
>>> >
>>> > p.apply(Select.fields("event.userid", "event.location.*");
>>> >
>>> > It would also fit into NewDoFn (Java) like this:
>>> >
>>> > @ProcessElement
>>> > public void process(@Field("userid") String userId,
>>> >                     @Field("action.location.*") Location location) {
>>> > }
>>> >
>>> > After some investigation, I believe that we're better off with
>>> something very close to a subset of JsonPath, but not precisely JsonPath.
>>>
>>> I am very wary of creating something that's very close to, but not
>>> quite, a (subset of) a well established standard. Is there
>>> disadvantage to not being a strict actual subset? If we go this route,
>>> we should at least ensure that any divergence is illegal JsonPath
>>> rather than having different semantic meaning.
>>>
>>
>> As far as I can tell, JsonPath isn't much of a "standard." There doesn't
>> seem to be much of a spec other than implementation.
>>
>> For the most part, I am speaking of a strict subset of JsonPath. The only
>> incompatibility is that JsonPath expressions all start with a '$' (which
>> represents the root node). So in the above expression you would write
>> "$.action.location.*" instead. I think staying closer to BeamSql syntax
>> makes more sense here, and I would like to dispense with the need to begin
>> with a $ character. JsonPath also assumes that each object is also a
>> JavaScript object (which makes no sense here), and some of the JsonPath
>> features are based on that.
>>
>>
>>> > JsonPath has many features that are Javascript specific (e.g. the
>>> ability to embed Javascript expressions), JsonPath also includes the
>>> ability to do complex filtering and aggregation, which I don't think we
>>> want here; Beam already provides the ability to do such filtering and
>>> aggregation, and it's not needed here. One example of a change: JsonPath
>>> queries always begin with $ (representing the root node), and I think we're
>>> better off not requiring that so that these queries look more like BeamSql
>>> queries.
>>> >
>>> > I've created a small ANTLR grammar (which has the advantage that it's
>>> easy to extend) for these expressions and have everything working in a
>>> branch. However there are a few more features of JsonPath that might be
>>> useful here, and I wanted community feedback to see whether it's worth
>>> implementing them.
>>> >
>>> > The first are array/map slices and selectors. Currently if a schema
>>> contains an array (or map) field, you can only select all elements of the
>>> array or map. JsonPath however supports selecting and slicing the array.
>>> For example, consider the following:
>>> >
>>> > @DefaultSchema(JavaFieldSchema.class)
>>> > public class Event {
>>> >   public final String userId;
>>> >   public final List<Action> actions;
>>> > }
>>> >
>>> > Currently you can apply Select.fields("actions.location"), and that
>>> will return a schema containing a list of Locations, one for every action
>>> in the original event. If we allowed slicing,  you could instead write
>>> Select.fields("actions[0:9].locations"), which would do the same but only
>>> for the first 10 elements of the array.
>>> >
>>> > Is this useful in Beam? It would not be hard to implement, but I want
>>> to see what folks think first.
>>> >
>>> > The second feature is recursive field selection. The example often
>>> given in JsonPath is a Json document containing the inventory for a store.
>>> There are lists of subobjects representing books, bicycles, tables, chairs,
>>> etc. etc. The JsonPath query "$..price" recursively finds every object that
>>> has a field named price, and returns those prices; in this case it returns
>>> the price of every element in the store.
>>> >
>>> > I'm a bit less convinced that recursive field selection is useful in
>>> Beam. The usual example for Json involves a document that represents an
>>> entire corpus, e.g. a store inventory. In Beam, the schemas are applied to
>>> individual records, and I don't know how often there will be a use for this
>>> sort of recursive selection. However I could be wrong here, so if anyone
>>> has a good use case for this sort of selection, please let me know.
>>>
>>> Records often contain lists, e.g. the record could be an order, and it
>>> could be useful to select on the price of the items (just to throw it
>>> out there).
>>>
>>
>> BTW, that already works. The .. operator in JsonPath is a recursive field
>> search, across any lists or records that are lower in the tree.
>>
>

Re: Query expressions for schema fields

Reply via email to