[COMP] Few questions about Query Optimizer

Wail Alkowaileet Sat, 24 Jun 2017 17:51:43 -0700

Hi Devs,

I have few questions about the query optimizer.


*- Given the query:*
use dataverse TwitterDataverse

for $x in dataset Tweets
where $x.name = "trump"
let $geo := $x.geo
group by $name:=$x.name with $geo
return {"name": $name, "geo":$geo[0].coordinates.coordinates}

*- Logical Plan:*
distribute result [$$10] -- |UNPARTITIONED|
  project ([$$10]) -- |UNPARTITIONED|
    assign [$$10] <- [{"name": $$name, "geo": get-item($$9,
0).getField("coordinates").getField("coordinates")}] -- |UNPARTITIONED|
      group by ([$$name := $$x.getField("name")]) decor ([]) {
                aggregate [$$9] <- [listify($$geo)] -- |UNPARTITIONED|
                  nested tuple source -- |UNPARTITIONED|
             } -- |UNPARTITIONED|
        assign [$$geo] <- [$$x.getField("geo")] -- |UNPARTITIONED|
          select (eq($$x.getField("name"), "Alice")) -- |UNPARTITIONED|
            unnest $$x <- dataset("Tweets") -- |UNPARTITIONED|
              empty-tuple-source -- |UNPARTITIONED|

*- Optimized Logical Plan:*
distribute result [$$10]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$10])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$10] <- [{"name": $$name, "geo": $$19.getField("coordinates")
}]
      -- ASSIGN  |PARTITIONED|
        project ([$$name, $$19])
        -- STREAM_PROJECT  |PARTITIONED|
          assign [$$19, $$22] <- [get-item($$9,
0).getField("coordinates"), get-item($$9,
0)]
          -- ASSIGN  |PARTITIONED|
            exchange
            -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
              group by ([$$name := $$15]) decor ([]) {
                        aggregate [$$9] <- [listify($$geo)]
                        -- AGGREGATE  |LOCAL|
                          nested tuple source
                          -- NESTED_TUPLE_SOURCE  |LOCAL|
                     }
              -- PRE_CLUSTERED_GROUP_BY[$$15]  |PARTITIONED|
                exchange
                -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                  order (ASC, $$15)
                  -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
                    exchange
                    -- HASH_PARTITION_EXCHANGE [$$15]  |PARTITIONED|
                      select (eq($$15, "Alice"))
                      -- STREAM_SELECT  |PARTITIONED|
                        project ([$$geo, $$15])
                        -- STREAM_PROJECT  |PARTITIONED|
                          assign [$$geo, $$15] <- [$$x.getField("geo"),
$$x.getField("name")]
                          -- ASSIGN  |PARTITIONED|
                            project ([$$x])
                            -- STREAM_PROJECT  |PARTITIONED|
                              exchange
                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                data-scan []<-[$$16, $$x] <-
TwitterDataverse.Tweets
                                -- DATASOURCE_SCAN  |PARTITIONED|
                                  exchange
                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                    empty-tuple-source
                                    -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

*- Questions:*
$$22:

   - Why the variable $22 is produced ? Although there is no use for it. Is
   it just a harmless bug or there's some intuition I might be missing?

$$19:

   - It seems (sometimes) getField function calls are splitted. Is there a
   reason why is that the case? (There's another example that reproduces the
   same behavior)
   - That leads to my next question, I see no rule for "FieldAccessNested"
   which can be exploited here to save few function calls. Can this function
   interfere with other functions/access methods?


-- 

*Regards,.*
Wail Alkowaileet

[COMP] Few questions about Query Optimizer

Reply via email to