Paul Rogers commented on DRILL-1162:

The challenge here is that we are looking for a quick bug fix solution to a 
problem which appears to be quite difficult. So, there is no good answer.

On the one hand, it seems that the planner cannot correctly predict the size of 
the two sides of the join. I am leery of a simple solution that just flips the 
sides. That is likely to fix the current issue, but break many queries that now 
work. This is true because, here, we are only worried about this one query. 
But, Drill has to handle all queries, even those that this ticket is not 
concerned with. Flipping the sides is likely to cause regressions, which will 
cause us to back out the change, and put us back where we started.

Is there a principled way to make the decision to flip? Perhaps based on the 
analysis above about the effect of cascaded joins?

Second, none of this addresses the real issue: that the hash join operator uses 
too much memory (heap in one scenario, direct in another.) There is no analysis 
of why we exhaust each resource, so it is not possible to identify if any 
particular hack is likely to solve the issue.

If direct memory exhaustion is caused by excessive hash join table size, then 
spill-to-disk may solve it. But, if we are building large tables unnecessarily 
(we've seen this in other cases), then smarter planning rules, or better 
run-time adjustment, may be needed.

If the heap is exhausted, then we have no understanding of why that should be 
so. What is using heap? The hash tables use direct memory. Do we understand why 
heap was exhausted?

Without understanding these fundamentals, we are only hacking and, IMHO, one 
hack is as good as another; they are just random shots in the dark. If we have 
very limited time to fix a deep issue, then hacking is all we can do, of course.

> 25 way join ended up with OOM
> -----------------------------
>                 Key: DRILL-1162
>                 URL: https://issues.apache.org/jira/browse/DRILL-1162
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow, Query Planning & Optimization
>            Reporter: Rahul Challapalli
>            Assignee: Volodymyr Vysotskyi
>            Priority: Critical
>             Fix For: Future
>         Attachments: error.log, oom_error.log
> git.commit.id.abbrev=e5c2da0
> The below query results in 0 results being returned 
> {code:sql}
> select count(*) from `lineitem1.parquet` a 
> inner join `part.parquet` j on a.l_partkey = j.p_partkey 
> inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey 
> inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey 
> inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey 
> = m.ps_suppkey 
> inner join `customer.parquet` n on k.o_custkey = n.c_custkey 
> inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey 
> inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey 
> inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey 
> inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice 
> inner join `lineitem2.parquet` f on a.l_comment = f.l_comment 
> inner join `lineitem2.parquet` g on a.l_shipdate = g.l_shipdate 
> inner join `lineitem2.parquet` h on a.l_commitdate = h.l_commitdate 
> inner join `lineitem2.parquet` i on a.l_receiptdate = i.l_receiptdate 
> inner join `lineitem2.parquet` o on a.l_receiptdate = o.l_receiptdate 
> inner join `lineitem2.parquet` p on a.l_receiptdate = p.l_receiptdate 
> inner join `lineitem2.parquet` q on a.l_receiptdate = q.l_receiptdate 
> inner join `lineitem2.parquet` r on a.l_receiptdate = r.l_receiptdate 
> inner join `lineitem2.parquet` s on a.l_receiptdate = s.l_receiptdate 
> inner join `lineitem2.parquet` t on a.l_receiptdate = t.l_receiptdate 
> inner join `lineitem2.parquet` u on a.l_receiptdate = u.l_receiptdate 
> inner join `lineitem2.parquet` v on a.l_receiptdate = v.l_receiptdate 
> inner join `lineitem2.parquet` w on a.l_receiptdate = w.l_receiptdate 
> inner join `lineitem2.parquet` x on a.l_receiptdate = x.l_receiptdate;
> {code}
> However when we remove the last 'inner join' and run the query it returns 
> '716372534'. Since the last inner join is similar to the one's before it, it 
> should match some records and return the data appropriately.
> The logs indicated that it actually returned 0 results. Attached the log file.

This message was sent by Atlassian JIRA

Reply via email to