Ning, Thanks for the explanation. I was able to take your example semijoin query and adapt it to my use case. Hive builds 3 map/reduce jobs from the query. The first one completes, but the second one still times out on the reducer. Perhaps my cluster is too small for this? It's only 4 nodes.
On Mon, Oct 26, 2009 at 1:47 PM, Ning Zhang <[email protected]> wrote: > This query is using the regular join to simulate semijoin. The proposed > semijoin syntax is something like: > > select A.id from A left semi join B on A.id = B.id; > > Semantically semijoin is the same as the simulated query using JOIN. but > proposed semijoin operation is more efficient in that: > 1) the simulation has a group by which requires a separate map-reduce task. > In the semijoin, the group by is map-side only so that we can eliminate one > map-reduce task. > 2) the semijoin implements the early-exit semantics: whenever a match is > found on B, the rest of the tuples in B will be omitted. While the JOIN will > keep matching even though a match is found. > > Thanks, > Ning > > The difference between the semijoin implementation and the simulation using > regular join is that the group by in the > > > > On Oct 26, 2009, at 9:56 AM, Ryan LeCompte wrote: > > I got the query working (my bad syntax caused it to error out)... However, > can you please explain what Hive is doing in your example query for the > semijoin? > > Thanks, > Ryan > > On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte <[email protected]>wrote: > >> Also, if I try your query I get: >> >> FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in >> expression specification >> >> >> Seems like it doesn't like SELECT (DISTINCT ...) >> >> >> >> On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]>wrote: >> >>> Ah, got it! >>> >>> Can you explain your semijoin query though? I see that you are using JOIN >>> but I don't see the ON (...) part. What does Hive do in this case when ON >>> (..) doesn't exist? >>> >>> >>> >>> On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]>wrote: >>> >>>> I see. If you just want all userid in one table that also appear in >>>> another table, semijoin would help (it is still under implementation >>>> though). You can simulate semijoin using JOIN as follows: >>>> >>>> select (distinct ut.userid) from usertracking ut join (select >>>> usertrackingid from streamtransfers group by usertrackingid where >>>> usertrackingid is not null and usertrackingid<>0) where userid is not null >>>> and userid <> 0; >>>> >>>> BTW, the rewritten query using UNION ALL is not semantically equivalent >>>> to the original query: it doesn't return "common" user IDs in both table, >>>> but just the "union" of them. >>>> >>>> Ning >>>> >>>> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote: >>>> >>>> Hi Ning, >>>> >>>> Basically, I have roughly 60GB of log data that is logically broken up >>>> into two Hive tables, which works out to be those two tables that I had >>>> mentioned. Both of these tables share a common key, but this key appears in >>>> virtually every row of each of the two tables. Each of these tables just >>>> has >>>> 1 partition (by date). My query is very similar (although with different >>>> data/columns) to Chris Bates' query: >>>> >>>> >>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN >>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE >>>> UT.UserID IS NOT NULL AND UT.UserID <> 0; >>>> >>>> However, in the above example imagine that in both tables the same users >>>> appear a lot, so there are lots of matches. >>>> >>>> >>>> >>>> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]>wrote: >>>> >>>>> If it is really a Cartesian product, there is no better way other than >>>>> increasing the timeout for the reducers. You can do a back-of-the-envelope >>>>> calculation on how long it takes (e.g., in your log it shows it takes 19 >>>>> sec. to get 6 million rows out of the join). This calculation can also be >>>>> done if it is not a Cartesian product, and you have an good estimate of >>>>> how >>>>> many rows will be produced. >>>>> >>>>> In general, joining two huge tables can be avoided by partitioning the >>>>> fact tables, so that you don't need to join the whole table. >>>>> >>>>> BTW, are you joining two fact tables or one dimension table is just >>>>> huge? >>>>> >>>>> Thanks, >>>>> Ning >>>>> >>>>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote: >>>>> >>>>> So it turns out that the JOIN key of my query basically results in a >>>>> match/join on all rows of each table! There really is no extra filtering >>>>> that I can do to exclude invalid rows, etc. The mappers fly by and >>>>> complete, >>>>> but the reducers are just moving extremely slowly (my guess due to what >>>>> Zheng said about the Cartesian product of all rows getting matched). >>>>> >>>>> Is there some other way that I could re-write the JOIN or is my only >>>>> option to increase the timeout on the task trackers so that they don't >>>>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes >>>>> (as opposed to the default of 10), and it doesn't seem to be sufficient... >>>>> Again, this is joining a 33GB table with a 13GB table where join key is >>>>> shared by virtually all rows in both tables. >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>> >>>>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote: >>>>> >>>>>> Thanks guys, very useful information. I will modify my query a bit and >>>>>> get back to you guys on whether it worked or not. >>>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Ryan, >>>>>>> >>>>>>> I asked this question a couple days ago but in a slightly different >>>>>>> form. What you have to do is make sure the table you're joining is >>>>>>> smaller >>>>>>> than the leftmost table. As an example, >>>>>>> >>>>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN >>>>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE >>>>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0; >>>>>>> >>>>>>> In this query, usertracking is a table that is about 8 or 9 gigs. >>>>>>> Streamtransfers is a table that is about 4 gigs. As per Zheng's >>>>>>> comment, I >>>>>>> omitted UserID's of Null or Zero as there are many rows with this key >>>>>>> and >>>>>>> the join worked as intended. >>>>>>> >>>>>>> Chris >>>>>>> >>>>>>> PS. As an aside, Hive is proving to be quite useful to all of our >>>>>>> database hackers here at Grooveshark. Thanks to everyone who has >>>>>>> committed...I hope to contribute soon. >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]>wrote: >>>>>>> >>>>>>>> It's probably caused by the Cartesian product of many rows from the >>>>>>>> two tables with the same key. >>>>>>>> >>>>>>>> Zheng >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> It also looks like the reducers just never stop outputting things >>>>>>>>> likethe (following -- see below), causing them to ultimately time >>>>>>>>> out and >>>>>>>>> get killed by the system. >>>>>>>>> >>>>>>>>> 2009-10-25 22:21:18,879 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 >>>>>>>>> rows >>>>>>>>> >>>>>>>>> 2009-10-25 22:21:22,009 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:22,010 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:25,141 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 >>>>>>>>> rows >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2009-10-25 22:21:25,142 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 >>>>>>>>> rows >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 >>>>>>>>> rows >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 >>>>>>>>> rows >>>>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 >>>>>>>>> rows >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hello all, >>>>>>>>>> >>>>>>>>>> Should I expect to be able to do a Hive JOIN between two tables >>>>>>>>>> that have about 10 or 15GB of data each? What I'm noticing (for a >>>>>>>>>> simple >>>>>>>>>> JOIN) is that all the map tasks complete, but the reducers just hang >>>>>>>>>> at >>>>>>>>>> around 87% or so (for the first set of 4 reducers), and then they >>>>>>>>>> eventually >>>>>>>>>> just get killed due to inability to respond by the cluster. I can do >>>>>>>>>> a JOIN >>>>>>>>>> between a large table and a very small table of 10 or so records >>>>>>>>>> just fine. >>>>>>>>>> >>>>>>>>>> Any thoughts? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Yours, >>>>>>>> Zheng >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> > >
