Rahul Challapalli created DRILL-1716:
----------------------------------------
Summary: Nested Data : There should be an easy way to apply
aggregate functions on repeated types
Key: DRILL-1716
URL: https://issues.apache.org/jira/browse/DRILL-1716
Project: Apache Drill
Issue Type: New Feature
Components: Functions - Drill, Storage - JSON
Reporter: Rahul Challapalli
{code}
{
"company_id": 1,
"evnts": [
{
"evnt_id": 999,
"evnt_duration": 60
},
{
"evnt_id": 998,
"evnt_duration": 30
},
{
"evnt_id": 997,
"evnt_duration": 45
}
]
}
{code}
For the above dataset, if I want to find the longest duration for each company
id, below is how I would do it now
{code}
select sub.company_id , max(sub.evnt.evnt_duration) max_duration
from (
select company_id, flatten(evnts) evnt from `nested.json`
) sub
group by sub.company_id;
+------------+--------------+
| company_id | max_duration |
+------------+--------------+
| 1 | 60 |
+------------+--------------+
{code}
Now if I want the evnt_id associated with the longest duration then we need one
more join
{code}
select a.company_id, b.evnt.evnt_id
from (
select sub.company_id company_id, max(sub.evnt.evnt_duration) max_duration
from (
select company_id, flatten(evnts) evnt from `nested.json`
) sub
group by sub.company_id
) a
join
(
select flatten(evnts) evnt from `nested.json`
) b
on a.max_duration = b.evnt.evnt_duration;
{code}
The above query currently fails (DRILL-1649). But that is how we have to do it
currently.
It would be much simpler if I can do something like the below
{code}
select company_id, nested_agg('evnts', 'max', 'evnt_duration','evnt_id') as
evnt_id from `nested.json`;
{code}
Apart from making the query much simpler to write this might enhance drill's
performance as well.
Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)