Re: Strange behavior with JOIN against correlated subquery

Paul Rogers Thu, 23 Jan 2020 16:13:26 -0800

Hi Andy,

When you have an array at the top level of your file, Drill treats the members 
as top-level tuples. Lateral is for the case where the data contains nested 
tuples. Try removing the LATERAL keyword.



Thanks,
- Paul

 

    On Thursday, January 23, 2020, 6:24:37 AM PST, Andy Grove 
<[email protected]> wrote:  
 
 I have the following JSON  file containing CPU metrics for multiple hosts.

[
  {"measurement": "cpu", "hostname": "host_0", "usage_irq":
63.6499207106198313, "timestamp_ns": 1451606400000000000},
  {"measurement": "cpu", "hostname": "host_0", "usage_irq":
58.0710619831364992, "timestamp_ns": 1451606500000000000},
  {"measurement": "cpu", "hostname": "host_1", "usage_irq":
76.0710649926198313, "timestamp_ns": 1451606400000000000},
  {"measurement": "cpu", "hostname": "host_1", "usage_irq":
45.9831364992071061, "timestamp_ns": 1451606500000000000}
]

I want to query this data to get the most recent entry for each host based
on the timestamp.

I assumed I would be able to use a LATERAL JOIN to perform this query
because the docs state that "A lateral subquery iterates through each row
in the table reference, evaluating the inner subquery for each row, like a
foreach loop.".

Here is the query that I thought would work:

SELECT *
FROM (SELECT DISTINCT hostname FROM dfs.root.`data/tsbs1.json`) t
LEFT OUTER JOIN LATERAL (SELECT timestamp_ns, usage_irq FROM
dfs.root.`data/tsbs1.json` c WHERE c.hostname = t.hostname ORDER BY
timestamp_ns DESC LIMIT 1) AS b ON TRUE

However, this query does not work as expected and returns:

hostname,timestamp_ns,usage_irq
host_1,null,null
host_0,1451606500000000000,58.0710619831365

It seems like the subquery is evaluated once and not per row in the FROM
table reference and therefore just a single row is returned rather than a
single row per outer row.

Here is the query plan:

00-00    Screen
00-01      Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2])
00-02        Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2])
00-03          Project(hostname=[$3], timestamp_ns=[$0], usage_irq=[$1],
hostname0=[$2])
00-04            SelectionVectorRemover
00-05              Sort(sort0=[$0], dir0=[DESC])
00-06                MergeJoin(condition=[=($3, $2)], joinType=[right])
00-08                  SelectionVectorRemover
00-10                    Sort(sort0=[$2], dir0=[ASC])
00-12                      Limit(fetch=[1])
00-14                        SelectionVectorRemover
00-15                          TopN(limit=[1])
00-16                            Project(timestamp_ns=[$1], usage_irq=[$2],
hostname=[$0])
00-17                              SelectionVectorRemover
00-18                                Filter(condition=[IS NOT NULL($0)])
00-19                                  Scan(table=[[dfs, root,
data/tsbs1.json]], groupscan=[EasyGroupScan
[selectionRoot=file:/data/tsbs1.json, numFiles=1, columns=[`hostname`,
`timestamp_ns`, `usage_irq`], files=[file:/data/tsbs1.json], schema=null]])
00-07                  Project(hostname0=[$0])
00-09                    StreamAgg(group=[{0}])
00-11                      Sort(sort0=[$0], dir0=[ASC])
00-13                        Scan(table=[[dfs, root, data/tsbs1.json]],
groupscan=[EasyGroupScan [selectionRoot=file:/data/tsbs1.json, numFiles=1,
columns=[`hostname`], files=[file:/data/tsbs1.json], schema=null]])

I'd appreciate any pointers on what I'm doing wrong here, or if this is a
bug in Drill.

Thanks,

Andy.

Re: Strange behavior with JOIN against correlated subquery

Reply via email to