I have the following JSON file containing CPU metrics for multiple hosts.
[
{"measurement": "cpu", "hostname": "host_0", "usage_irq":
63.6499207106198313, "timestamp_ns": 1451606400000000000},
{"measurement": "cpu", "hostname": "host_0", "usage_irq":
58.0710619831364992, "timestamp_ns": 1451606500000000000},
{"measurement": "cpu", "hostname": "host_1", "usage_irq":
76.0710649926198313, "timestamp_ns": 1451606400000000000},
{"measurement": "cpu", "hostname": "host_1", "usage_irq":
45.9831364992071061, "timestamp_ns": 1451606500000000000}
]
I want to query this data to get the most recent entry for each host based
on the timestamp.
I assumed I would be able to use a LATERAL JOIN to perform this query
because the docs state that "A lateral subquery iterates through each row
in the table reference, evaluating the inner subquery for each row, like a
foreach loop.".
Here is the query that I thought would work:
SELECT *
FROM (SELECT DISTINCT hostname FROM dfs.root.`data/tsbs1.json`) t
LEFT OUTER JOIN LATERAL (SELECT timestamp_ns, usage_irq FROM
dfs.root.`data/tsbs1.json` c WHERE c.hostname = t.hostname ORDER BY
timestamp_ns DESC LIMIT 1) AS b ON TRUE
However, this query does not work as expected and returns:
hostname,timestamp_ns,usage_irq
host_1,null,null
host_0,1451606500000000000,58.0710619831365
It seems like the subquery is evaluated once and not per row in the FROM
table reference and therefore just a single row is returned rather than a
single row per outer row.
Here is the query plan:
00-00 Screen
00-01 Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2])
00-02 Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2])
00-03 Project(hostname=[$3], timestamp_ns=[$0], usage_irq=[$1],
hostname0=[$2])
00-04 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[DESC])
00-06 MergeJoin(condition=[=($3, $2)], joinType=[right])
00-08 SelectionVectorRemover
00-10 Sort(sort0=[$2], dir0=[ASC])
00-12 Limit(fetch=[1])
00-14 SelectionVectorRemover
00-15 TopN(limit=[1])
00-16 Project(timestamp_ns=[$1], usage_irq=[$2],
hostname=[$0])
00-17 SelectionVectorRemover
00-18 Filter(condition=[IS NOT NULL($0)])
00-19 Scan(table=[[dfs, root,
data/tsbs1.json]], groupscan=[EasyGroupScan
[selectionRoot=file:/data/tsbs1.json, numFiles=1, columns=[`hostname`,
`timestamp_ns`, `usage_irq`], files=[file:/data/tsbs1.json], schema=null]])
00-07 Project(hostname0=[$0])
00-09 StreamAgg(group=[{0}])
00-11 Sort(sort0=[$0], dir0=[ASC])
00-13 Scan(table=[[dfs, root, data/tsbs1.json]],
groupscan=[EasyGroupScan [selectionRoot=file:/data/tsbs1.json, numFiles=1,
columns=[`hostname`], files=[file:/data/tsbs1.json], schema=null]])
I'd appreciate any pointers on what I'm doing wrong here, or if this is a
bug in Drill.
Thanks,
Andy.