Hello,
A few format handled by Drill enable to work with document, meaning nested
and repeated structure instead of just tables. Json and Parquet are the two
that come to my mind right now. Document modeling is a great way to express
complex object and is used a lot in my company. Drill is able to handle
them but unfortunately, it cannot make much computation on it. By
computation I mean, filtering branches of the document, computing
statistics (avg, min, max) on part of the document … That would be very
useful as an analytic tools.
_What can be done_
The question then is how to express the computation we want to do on the
document. I have found multiple ways to handle that and I don't really know
which one is the best hence the mail to expose what I have found to
initiate discussion, maybe.
First, in we look back at the Dremel paper which is the base of the
parquet format and also one of the example for drill, dremel is adding the
special keyword "WITHIN" to SQL to specify that the computation has to be
done within a document. What is very powerful with this keyword is that it
allows you to generate document and doesn't force you to flatten
everything. You can find exemple of it usage in the google successor of
Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
ery/docs/legacy-nested-repeated.
But it seems that it was problematic for Google, because they now propose
a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such
computation. I am not familiar with SQL 2011 but it is told in BigQuery
documentation to integrated the keywords for nested and repeated structure.
You can have a view about how this is done in BigQuery here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
Basically, what I have seen is that they leverage UNNEST and ARRAY keyword
and then are able to use JOIN or CROSS JOIN to describe the aggregation.
In Impala, they have added a way to add a subquery on a complex type in
such a way that the subquery only act intra-document. I have no idea if
this is standard SQL or not. In page https://www.cloudera.com/docum
entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
example.
In Presto, you can apply lambda function to map/array to transform the
structure and apply filter on it. So you have filter, map_filter function
to filter array and map respectively. (cf https://prestodb.io/docs/curre
nt/functions/lambda.html#filter)
_Example_
If I want to make a short example, let’s say we have a flight with a group
of passengers in it. A document would be :
{ “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
{“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
The database would be millions of such document and I want to know the
average age of the male passenger for every flight.
In Dremel, the query would be something like: select flightnb,
avg(male_age) within record from (select groups.age as male_age from flight
where group.gender = "M")
With sql, it would be something like: select flightnb, avg(male_age) from
(array(select g.age as male_age from unnest(group)as g where g.gender =
"M") as male_age)
With impala it would be something like: select flightnb, avg(male) from
flight, select g.age from groups as g where g.gender = “M” as male
With presto, it would be something like: select flightnb, avg(male) from
flight, filter(group,x->x.gender = "M")as male
I am not sure at all about my SQL queries but it should give you a rough
idea about the different ways to express the inital query.
So many different ways to express the same query… I would personally go
for the SQL way of expressing things to implement it in Drill, especially
because calcite is already able to parse unnest, array, but that’s only my
first thought.
Best regards,
Damien