Re: [EXT] Re: Food for thought about intra-document operation

Damien Profeta Tue, 26 Sep 2017 00:13:41 -0700

Hello Aman,

AsterixDb seems to follow the standard SQL with a few minormodifications and add functions to ease aggregations (array_count,array_avg…)

That would tend to confirm at least that the support of unnest is a goodidea to improve Drill.


Best regards

Damien

**
On 09/25/2017 07:53 PM, Aman Sinha wrote:

Damien,
thanks for initiating the discussion..indeed this would be a very useful
enhancement.  Currently, Drill provides repeated_contains()  for filtering
and repeated_count() for count aggregates on arrays but not the general
purpose intra-document operations that you need based on your example.
I haven't gone through all the alternatives but in addition to what you
have described,  you might also want to look at SQL++ (
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has been
adopted by AsterixDB and has syntax extensions to SQL for unstructured
data.

-Aman

On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <damien.prof...@amadeus.com>
wrote:

Hello,

A few format handled by Drill enable to work with document, meaning nested
and repeated structure instead of just tables. Json and Parquet are the two
that come to my mind right now. Document modeling is a great way to express
complex object and is used a lot in my company. Drill is able to handle
them but unfortunately, it cannot make much computation on it. By
computation I mean, filtering branches of the document, computing
statistics (avg, min, max) on part of the document … That would be very
useful as an analytic tools.

_What can be done_

The question then is how to express the computation we want to do on the
document. I have found multiple ways to handle that and I don't really know
which one is the best hence the mail to expose what I have found to
initiate discussion, maybe.

First, in we look back at the Dremel paper which is the base of the
parquet format and also one of the example for drill, dremel is adding the
special keyword "WITHIN" to SQL to specify that the computation has to be
done within a document. What is very powerful with this keyword is that it
allows you to generate document and doesn't force you to flatten
everything. You can find exemple of it usage in the google successor of
Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
ery/docs/legacy-nested-repeated.

But it seems that it was problematic for Google, because they now propose
a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such
computation. I am not familiar with SQL 2011 but it is told in BigQuery
documentation to integrated the keywords for nested and repeated structure.
You can have a view about how this is done in BigQuery here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
Basically, what I have seen is that they leverage UNNEST and ARRAY keyword
and then are able to use JOIN or CROSS JOIN to describe the aggregation.

In Impala, they have added a way to add a subquery on a complex type in
such a way that the subquery only act intra-document. I have no idea if
this is standard SQL or not. In page https://www.cloudera.com/docum
entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
example.

In Presto, you can apply lambda function to map/array to transform the
structure and apply filter on it. So you have filter, map_filter function
to filter array and map respectively. (cf https://prestodb.io/docs/curre
nt/functions/lambda.html#filter)

_Example_

If I want to make a short example, let’s say we have a flight with a group
of passengers in it. A document would be :

{ “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
{“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}

The database would be millions of such document and I want to know the
average age of the male passenger for every flight.

In Dremel, the query would be something like: select flightnb,
avg(male_age) within record from (select groups.age as male_age from flight
where group.gender = "M")

With sql, it would be something like: select flightnb, avg(male_age) from
(array(select g.age as male_age from unnest(group)as g where g.gender =
"M") as male_age)

With impala it would be something like: select flightnb, avg(male) from
flight, select g.age from groups as g where g.gender = “M” as male

With presto, it would be something like: select flightnb, avg(male) from
flight, filter(group,x->x.gender = "M")as male

I am not sure at all about my SQL queries but it should give you a rough
idea about the different ways to express the inital query.

So many different ways to express the same query… I would personally go
for the SQL way of expressing things to implement it in Drill, especially
because calcite is already able to parse unnest, array, but that’s only my
first thought.

Best regards,

Damien

Re: [EXT] Re: Food for thought about intra-document operation

Reply via email to