[
https://issues.apache.org/jira/browse/PHOENIX-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453323#comment-15453323
]
Maryann Xue commented on PHOENIX-3224:
--------------------------------------
The sort-merge join takes inputs from both sides of the join in a streaming
fashion, so the performance should be close to sorting both tables. There are
chances though that sort merge join would have to cache a lot of data on the
client side and perform really badly. It is when there are a big amount of rows
from both sides that have the same join keys. In that case there will be
caching and backtracking to cross join all those rows.
> Observations from large scale testing.
> --------------------------------------
>
> Key: PHOENIX-3224
> URL: https://issues.apache.org/jira/browse/PHOENIX-3224
> Project: Phoenix
> Issue Type: Task
> Reporter: Lars Hofhansl
>
> We have a >1000 node physical cluster at our disposal for a short time,
> before it'll be handed off to its intended use.
> Loaded a bunch of data (TPCs LINEITEM table, among others) and ran a bunch of
> queries. Most tables are between 100G and 500G (uncompressed) and between
> 600m and 2bn rows.
> The good news is that many things just worked. We sorted > 400G is < 5s with
> HBase and Phoenix. Scans work. Joins work (as long as one side is kept under
> 1m rows or so).
> For the issues we observers I'll file sub jiras under this.
> I'm going to write a lob post about this and attach a link here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)