Paul Rogers commented on DRILL-1162:
The challenge here is that we are looking for a quick bug fix solution to a
problem which appears to be quite difficult. So, there is no good answer.
On the one hand, it seems that the planner cannot correctly predict the size of
the two sides of the join. I am leery of a simple solution that just flips the
sides. That is likely to fix the current issue, but break many queries that now
work. This is true because, here, we are only worried about this one query.
But, Drill has to handle all queries, even those that this ticket is not
concerned with. Flipping the sides is likely to cause regressions, which will
cause us to back out the change, and put us back where we started.
Is there a principled way to make the decision to flip? Perhaps based on the
analysis above about the effect of cascaded joins?
Second, none of this addresses the real issue: that the hash join operator uses
too much memory (heap in one scenario, direct in another.) There is no analysis
of why we exhaust each resource, so it is not possible to identify if any
particular hack is likely to solve the issue.
If direct memory exhaustion is caused by excessive hash join table size, then
spill-to-disk may solve it. But, if we are building large tables unnecessarily
(we've seen this in other cases), then smarter planning rules, or better
run-time adjustment, may be needed.
If the heap is exhausted, then we have no understanding of why that should be
so. What is using heap? The hash tables use direct memory. Do we understand why
heap was exhausted?
Without understanding these fundamentals, we are only hacking and, IMHO, one
hack is as good as another; they are just random shots in the dark. If we have
very limited time to fix a deep issue, then hacking is all we can do, of course.
> 25 way join ended up with OOM
> Key: DRILL-1162
> URL: https://issues.apache.org/jira/browse/DRILL-1162
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow, Query Planning & Optimization
> Reporter: Rahul Challapalli
> Assignee: Volodymyr Vysotskyi
> Priority: Critical
> Fix For: Future
> Attachments: error.log, oom_error.log
> The below query results in 0 results being returned
> select count(*) from `lineitem1.parquet` a
> inner join `part.parquet` j on a.l_partkey = j.p_partkey
> inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey
> inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey
> inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey
> = m.ps_suppkey
> inner join `customer.parquet` n on k.o_custkey = n.c_custkey
> inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey
> inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey
> inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey
> inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice
> inner join `lineitem2.parquet` f on a.l_comment = f.l_comment
> inner join `lineitem2.parquet` g on a.l_shipdate = g.l_shipdate
> inner join `lineitem2.parquet` h on a.l_commitdate = h.l_commitdate
> inner join `lineitem2.parquet` i on a.l_receiptdate = i.l_receiptdate
> inner join `lineitem2.parquet` o on a.l_receiptdate = o.l_receiptdate
> inner join `lineitem2.parquet` p on a.l_receiptdate = p.l_receiptdate
> inner join `lineitem2.parquet` q on a.l_receiptdate = q.l_receiptdate
> inner join `lineitem2.parquet` r on a.l_receiptdate = r.l_receiptdate
> inner join `lineitem2.parquet` s on a.l_receiptdate = s.l_receiptdate
> inner join `lineitem2.parquet` t on a.l_receiptdate = t.l_receiptdate
> inner join `lineitem2.parquet` u on a.l_receiptdate = u.l_receiptdate
> inner join `lineitem2.parquet` v on a.l_receiptdate = v.l_receiptdate
> inner join `lineitem2.parquet` w on a.l_receiptdate = w.l_receiptdate
> inner join `lineitem2.parquet` x on a.l_receiptdate = x.l_receiptdate;
> However when we remove the last 'inner join' and run the query it returns
> '716372534'. Since the last inner join is similar to the one's before it, it
> should match some records and return the data appropriately.
> The logs indicated that it actually returned 0 results. Attached the log file.
This message was sent by Atlassian JIRA