Hi Devs,
I've been in the Algebricks vicinity lately and I think there are few
things we can do to reduce the plan size and probably the execution time. I
will file a JIRA issue for other things I noticed.
First I want to discuss the current use of the Assign operator as I need it
for my current work.
Let's see an example:
*-- Query:*
SELECT t.text as text, t.place.full_name as city
FROM Tweets as t
WHERE t.retweet_count > 10
AND spatial_intersect (t.geo.coordinates.coordinates,
create_rectangle(create_point(-107.27, 33.06), create_point(-89.1,
38.9)));
*-- Plan:*
distribute result [$$19]
-- DISTRIBUTE_RESULT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
project ([$$19])
-- STREAM_PROJECT |PARTITIONED|
assign [$$19] <- [{"text": $$t.getField("text"), "city":
$$25.getField("full_name")}]
-- ASSIGN |PARTITIONED|
project ([$$t, $$25])
-- STREAM_PROJECT |PARTITIONED|
select (and(gt($$t.getField("retweet_count"), 10),
spatial-intersect($$27.getField("coordinates"), rectangle: { p1: point: {
x: -107.27, y: 33.06 }, p2: point: { x: -89.1, y: 38.9 }})))
-- STREAM_SELECT |PARTITIONED|
assign [$$27, $$25] <-
[$$t.getField("geo").getField("coordinates"), $$t.getField("place")]
-- ASSIGN |PARTITIONED|
project ([$$t])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
data-scan []<-[$$20, $$t] <- TwitterDataverse.Tweets
-- DATASOURCE_SCAN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|
*-- Observation:*
- In this example, *assign [$$27, $$25]* evaluates*
$$t.getField("geo").getField("coordinates")* ($$27) even though it might
not to be used (short-circuited in the AND).
- Similarly, because *assign [$$27, $$25] *evaluates *$t.getField("place")*
($$25) much earlier, the size of project ([$$t, $$25]) is greater than
project ([$$t]). Given that $$25 can be evaluated from $$t.
- We can see that Assign does not do anything good in this case and
probably should be removed.
There are two policies but not sure which one is better:
1- Aggressively push down field access to fit more tuples/frame, but might
do unnecessary evaluation as in the example above.
2- Push down SELECT and only evaluate common expression with the SELECT and
then do field access afterwords. But might have less tuples/frame.
Also:
1- Assign that only been used once should be inlined (inline if the upper
operator can do scalar evaluation such as select/assign). **Some plans have
two consecutives assigns.
I'm leaning toward (2) for the reason that IScalarEvaluators are chained
and works per tuple basis (almost an iterator-model in a frame) and can be
more expensive in terms of function calls.
Any suggestions?
--
*Regards,*
Wail Alkowaileet