Hi Andy, When you have an array at the top level of your file, Drill treats the members as top-level tuples. Lateral is for the case where the data contains nested tuples. Try removing the LATERAL keyword.
Thanks, - Paul On Thursday, January 23, 2020, 6:24:37 AM PST, Andy Grove <andygrov...@gmail.com> wrote: I have the following JSON file containing CPU metrics for multiple hosts. [ {"measurement": "cpu", "hostname": "host_0", "usage_irq": 63.6499207106198313, "timestamp_ns": 1451606400000000000}, {"measurement": "cpu", "hostname": "host_0", "usage_irq": 58.0710619831364992, "timestamp_ns": 1451606500000000000}, {"measurement": "cpu", "hostname": "host_1", "usage_irq": 76.0710649926198313, "timestamp_ns": 1451606400000000000}, {"measurement": "cpu", "hostname": "host_1", "usage_irq": 45.9831364992071061, "timestamp_ns": 1451606500000000000} ] I want to query this data to get the most recent entry for each host based on the timestamp. I assumed I would be able to use a LATERAL JOIN to perform this query because the docs state that "A lateral subquery iterates through each row in the table reference, evaluating the inner subquery for each row, like a foreach loop.". Here is the query that I thought would work: SELECT * FROM (SELECT DISTINCT hostname FROM dfs.root.`data/tsbs1.json`) t LEFT OUTER JOIN LATERAL (SELECT timestamp_ns, usage_irq FROM dfs.root.`data/tsbs1.json` c WHERE c.hostname = t.hostname ORDER BY timestamp_ns DESC LIMIT 1) AS b ON TRUE However, this query does not work as expected and returns: hostname,timestamp_ns,usage_irq host_1,null,null host_0,1451606500000000000,58.0710619831365 It seems like the subquery is evaluated once and not per row in the FROM table reference and therefore just a single row is returned rather than a single row per outer row. Here is the query plan: 00-00 Screen 00-01 Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2]) 00-02 Project(hostname=[$0], timestamp_ns=[$1], usage_irq=[$2]) 00-03 Project(hostname=[$3], timestamp_ns=[$0], usage_irq=[$1], hostname0=[$2]) 00-04 SelectionVectorRemover 00-05 Sort(sort0=[$0], dir0=[DESC]) 00-06 MergeJoin(condition=[=($3, $2)], joinType=[right]) 00-08 SelectionVectorRemover 00-10 Sort(sort0=[$2], dir0=[ASC]) 00-12 Limit(fetch=[1]) 00-14 SelectionVectorRemover 00-15 TopN(limit=[1]) 00-16 Project(timestamp_ns=[$1], usage_irq=[$2], hostname=[$0]) 00-17 SelectionVectorRemover 00-18 Filter(condition=[IS NOT NULL($0)]) 00-19 Scan(table=[[dfs, root, data/tsbs1.json]], groupscan=[EasyGroupScan [selectionRoot=file:/data/tsbs1.json, numFiles=1, columns=[`hostname`, `timestamp_ns`, `usage_irq`], files=[file:/data/tsbs1.json], schema=null]]) 00-07 Project(hostname0=[$0]) 00-09 StreamAgg(group=[{0}]) 00-11 Sort(sort0=[$0], dir0=[ASC]) 00-13 Scan(table=[[dfs, root, data/tsbs1.json]], groupscan=[EasyGroupScan [selectionRoot=file:/data/tsbs1.json, numFiles=1, columns=[`hostname`], files=[file:/data/tsbs1.json], schema=null]]) I'd appreciate any pointers on what I'm doing wrong here, or if this is a bug in Drill. Thanks, Andy.