Re: Drill SQL questions - JSON context

Paul Rogers Fri, 18 Aug 2023 14:24:26 -0700

Hi Mike,

Good progress! There are a number of factors to consider. Let's work
through them one by one.

First, try the simplest possible query:

SELECT * FROM <the json file>

If you are using the row set mechanism, grab the schema and print it. (My
memory is hazy, but I do believe that there are methods and classes that
will do this for you.) What you should see is the nested structure you
created. The JSON reader has a super-complex parser that will work out
structure and types based on the first value seen. In your example, it
should guess VARCHAR and INT for your data items.

Once you confirm that the JSON parser has correctly interpreted your data,
you can move onto the second question: how SQL works with structured data.
Here we have to realize that SQL wasn't designed for structured data: SQL
only knows how to work with variables projected to the top level. This
leads to a quantum-like result that observing a variable changes its
structure. The JSON reader uses something called the Enhanced Vector
Framework (EVF) to do the projecting at scan time. Let's work out what it
should be doing. (The code is gawd-awful complex, so there is a chance that
something might be broken.)

In your query, you are projecting c.f.g to a top level variable g. Fields c
and f are an array of maps. I can't recall testing this kind of projection,
but I'd expect it to result in the projected variable g being an array or
arrays: what Drill calls a repeated list. Although I wrote this stuff, I
don't recall any code that will convert a repeated map into a repeated
list: so this area may be a bit tender. Or, maybe it just punts and leaves
the repeated map, but with a single entry? That wouldn't quite work. This
could use a bit of testing.

The third question is how to flatten rows. Flattening occurs via a separate
flatten operator. You'd need to flatten twice: once for each level. This
whole area is a bit hazy for me (I'm not super familiar with the details),
but I suspect you'd need to use a set of nested SELECT statements, each of
which flattens the outermost level, which will project the result to the
top level where it can be manipulated by the SELECT at the next outer
level. To try this, extend your SELECT * to select just a top-level field
(a) and flatten a top-level repeated map (f). The result should be rows
with a scalar and a repeated map. Then, add another level of SELECT to
flatten the repeated map: you'll get a scalar and a map. Then, use yet
another SELECT to pick out the map fields to top-level fields, and do the
WHERE clause. I *think* that should more-or-less work.

- Paul

On Fri, Aug 18, 2023 at 2:01 PM Mike Beckerle <mbecke...@apache.org> wrote:

> I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
> moment, until we build a tighter integration. This is my conceptual test
> framework for that integration.)
>
> I have parsed data to create this JSON which represents 2-level nested
> repeating subrecords.
>
> All the simple fields are int.
>
> [{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
>                                              {"g":7,  "h":8 }]},
>                        {"d":9,  "e":10, "f":[{"g":11, "h":12},
>                                              {"g":13, "h":14}]}]},
>  {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
>                                              {"g":27, "h":28 }]},
>                        {"d":29, "e":30, "f":[{"g":31, "h":32},
>                                              {"g":33, "h":34}]}]}]
>
> So, the top level is a vector of maps,
> within that, field "c" is a vector of maps,
> and within "c" is a field f which is a vector of maps.
>
> The reason I created this is I'm trying to understand the arrays and how
> they work with Drill SQL.
>
> I'm trying to figure out how to get this rowset of 3 rows from a query, and
> I'm stumped.
>
>   a   b   d   e   g   h
> ( 1,  2,  3,  4,  5,  6)
> ( 1,  2,  9, 10, 13, 14)
> (21, 22, 29, 30, 33, 34)
>
> This is the SQL that is my conceptual framework, but I'm sure it won't
> work.
>
> SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
> FROM ... the json file...
> WHERE g mod 10 == 3 OR g == 5
>
> But I know it's not going to be that easy to get the query to traverse the
> vector inside the vector.
>
> From the doc, the FLATTEN operator seems to be needed, but I can't really
> figure it out.
>
> This is what all my data is like. Trees of nested vectors of sub-records.
>
> Can anyone advise on what the SQL might look like, or where there's an
> example doing something like this I can learn from?
>
> Thanks for any help
>
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com
>

Re: Drill SQL questions - JSON context

Reply via email to