Damien, for the intra-document operations, it would be useful to add support for LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN) should address the use case you have. I have filed a JIRA for this: https://issues.apache.org/jira/browse/DRILL-5999.
-Aman On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta <[email protected] > wrote: > Hello Aman, > > AsterixDb seems to follow the standard SQL with a few minor modifications > and add functions to ease aggregations (array_count, array_avg…) > > That would tend to confirm at least that the support of unnest is a good > idea to improve Drill. > > Best regards > > Damien > > ** > > On 09/25/2017 07:53 PM, Aman Sinha wrote: > >> Damien, >> thanks for initiating the discussion..indeed this would be a very useful >> enhancement. Currently, Drill provides repeated_contains() for filtering >> and repeated_count() for count aggregates on arrays but not the general >> purpose intra-document operations that you need based on your example. >> I haven't gone through all the alternatives but in addition to what you >> have described, you might also want to look at SQL++ ( >> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has >> been >> adopted by AsterixDB and has syntax extensions to SQL for unstructured >> data. >> >> -Aman >> >> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta < >> [email protected]> >> wrote: >> >> Hello, >>> >>> A few format handled by Drill enable to work with document, meaning >>> nested >>> and repeated structure instead of just tables. Json and Parquet are the >>> two >>> that come to my mind right now. Document modeling is a great way to >>> express >>> complex object and is used a lot in my company. Drill is able to handle >>> them but unfortunately, it cannot make much computation on it. By >>> computation I mean, filtering branches of the document, computing >>> statistics (avg, min, max) on part of the document … That would be very >>> useful as an analytic tools. >>> >>> _What can be done_ >>> >>> The question then is how to express the computation we want to do on the >>> document. I have found multiple ways to handle that and I don't really >>> know >>> which one is the best hence the mail to expose what I have found to >>> initiate discussion, maybe. >>> >>> First, in we look back at the Dremel paper which is the base of the >>> parquet format and also one of the example for drill, dremel is adding >>> the >>> special keyword "WITHIN" to SQL to specify that the computation has to be >>> done within a document. What is very powerful with this keyword is that >>> it >>> allows you to generate document and doesn't force you to flatten >>> everything. You can find exemple of it usage in the google successor of >>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu >>> ery/docs/legacy-nested-repeated. >>> >>> But it seems that it was problematic for Google, because they now propose >>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle >>> such >>> computation. I am not familiar with SQL 2011 but it is told in BigQuery >>> documentation to integrated the keywords for nested and repeated >>> structure. >>> You can have a view about how this is done in BigQuery here: >>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays . >>> Basically, what I have seen is that they leverage UNNEST and ARRAY >>> keyword >>> and then are able to use JOIN or CROSS JOIN to describe the aggregation. >>> >>> In Impala, they have added a way to add a subquery on a complex type in >>> such a way that the subquery only act intra-document. I have no idea if >>> this is standard SQL or not. In page https://www.cloudera.com/docum >>> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types >>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for >>> example. >>> >>> In Presto, you can apply lambda function to map/array to transform the >>> structure and apply filter on it. So you have filter, map_filter function >>> to filter array and map respectively. (cf https://prestodb.io/docs/curre >>> nt/functions/lambda.html#filter) >>> >>> _Example_ >>> >>> If I want to make a short example, let’s say we have a flight with a >>> group >>> of passengers in it. A document would be : >>> >>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M >>> ”},{“age”:15,”gender”:”F”}, >>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]} >>> >>> The database would be millions of such document and I want to know the >>> average age of the male passenger for every flight. >>> >>> In Dremel, the query would be something like: select flightnb, >>> avg(male_age) within record from (select groups.age as male_age from >>> flight >>> where group.gender = "M") >>> >>> With sql, it would be something like: select flightnb, avg(male_age) from >>> (array(select g.age as male_age from unnest(group)as g where g.gender = >>> "M") as male_age) >>> >>> With impala it would be something like: select flightnb, avg(male) from >>> flight, select g.age from groups as g where g.gender = “M” as male >>> >>> With presto, it would be something like: select flightnb, avg(male) from >>> flight, filter(group,x->x.gender = "M")as male >>> >>> I am not sure at all about my SQL queries but it should give you a rough >>> idea about the different ways to express the inital query. >>> >>> So many different ways to express the same query… I would personally go >>> for the SQL way of expressing things to implement it in Drill, especially >>> because calcite is already able to parse unnest, array, but that’s only >>> my >>> first thought. >>> >>> Best regards, >>> >>> Damien >>> >>> >>> >
