Ah, I see. And it's probably not worth trying to change the hive.exec.reducers.bytes.per.reducer to something smaller, right?
On Mon, Oct 26, 2009 at 2:12 PM, Ning Zhang <[email protected]> wrote: > Yes, 4 nodes probably is too small for this. The default value of bytes per > reducer is 1GB (see hive.exec.reducers.bytes.per.reducer in > hive-default.xml). > > > Ning > > On Oct 26, 2009, at 10:56 AM, Ryan LeCompte wrote: > > Ning, > > Thanks for the explanation. I was able to take your example semijoin query > and adapt it to my use case. Hive builds 3 map/reduce jobs from the query. > The first one completes, but the second one still times out on the reducer. > Perhaps my cluster is too small for this? It's only 4 nodes. > > > On Mon, Oct 26, 2009 at 1:47 PM, Ning Zhang <[email protected]> wrote: > >> This query is using the regular join to simulate semijoin. The proposed >> semijoin syntax is something like: >> >> select A.id from A left semi join B on A.id = B.id; >> >> Semantically semijoin is the same as the simulated query using JOIN. but >> proposed semijoin operation is more efficient in that: >> 1) the simulation has a group by which requires a separate map-reduce >> task. In the semijoin, the group by is map-side only so that we can >> eliminate one map-reduce task. >> 2) the semijoin implements the early-exit semantics: whenever a match is >> found on B, the rest of the tuples in B will be omitted. While the JOIN will >> keep matching even though a match is found. >> >> Thanks, >> Ning >> >> The difference between the semijoin implementation and the simulation >> using regular join is that the group by in the >> >> >> >> On Oct 26, 2009, at 9:56 AM, Ryan LeCompte wrote: >> >> I got the query working (my bad syntax caused it to error out)... However, >> can you please explain what Hive is doing in your example query for the >> semijoin? >> >> Thanks, >> Ryan >> >> On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte <[email protected]>wrote: >> >>> Also, if I try your query I get: >>> >>> FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in >>> expression specification >>> >>> >>> Seems like it doesn't like SELECT (DISTINCT ...) >>> >>> >>> >>> On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]>wrote: >>> >>>> Ah, got it! >>>> >>>> Can you explain your semijoin query though? I see that you are using >>>> JOIN but I don't see the ON (...) part. What does Hive do in this case when >>>> ON (..) doesn't exist? >>>> >>>> >>>> >>>> On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]>wrote: >>>> >>>>> I see. If you just want all userid in one table that also appear in >>>>> another table, semijoin would help (it is still under implementation >>>>> though). You can simulate semijoin using JOIN as follows: >>>>> >>>>> select (distinct ut.userid) from usertracking ut join (select >>>>> usertrackingid from streamtransfers group by usertrackingid where >>>>> usertrackingid is not null and usertrackingid<>0) where userid is not null >>>>> and userid <> 0; >>>>> >>>>> BTW, the rewritten query using UNION ALL is not semantically equivalent >>>>> to the original query: it doesn't return "common" user IDs in both table, >>>>> but just the "union" of them. >>>>> >>>>> Ning >>>>> >>>>> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote: >>>>> >>>>> Hi Ning, >>>>> >>>>> Basically, I have roughly 60GB of log data that is logically broken up >>>>> into two Hive tables, which works out to be those two tables that I had >>>>> mentioned. Both of these tables share a common key, but this key appears >>>>> in >>>>> virtually every row of each of the two tables. Each of these tables just >>>>> has >>>>> 1 partition (by date). My query is very similar (although with different >>>>> data/columns) to Chris Bates' query: >>>>> >>>>> >>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN >>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE >>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0; >>>>> >>>>> However, in the above example imagine that in both tables the same >>>>> users appear a lot, so there are lots of matches. >>>>> >>>>> >>>>> >>>>> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]>wrote: >>>>> >>>>>> If it is really a Cartesian product, there is no better way other than >>>>>> increasing the timeout for the reducers. You can do a >>>>>> back-of-the-envelope >>>>>> calculation on how long it takes (e.g., in your log it shows it takes 19 >>>>>> sec. to get 6 million rows out of the join). This calculation can also be >>>>>> done if it is not a Cartesian product, and you have an good estimate of >>>>>> how >>>>>> many rows will be produced. >>>>>> >>>>>> In general, joining two huge tables can be avoided by partitioning the >>>>>> fact tables, so that you don't need to join the whole table. >>>>>> >>>>>> BTW, are you joining two fact tables or one dimension table is just >>>>>> huge? >>>>>> >>>>>> Thanks, >>>>>> Ning >>>>>> >>>>>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote: >>>>>> >>>>>> So it turns out that the JOIN key of my query basically results in a >>>>>> match/join on all rows of each table! There really is no extra filtering >>>>>> that I can do to exclude invalid rows, etc. The mappers fly by and >>>>>> complete, >>>>>> but the reducers are just moving extremely slowly (my guess due to what >>>>>> Zheng said about the Cartesian product of all rows getting matched). >>>>>> >>>>>> Is there some other way that I could re-write the JOIN or is my only >>>>>> option to increase the timeout on the task trackers so that they don't >>>>>> timeout/kill the reducers? I've already upped their timeouts to 30 >>>>>> minutes >>>>>> (as opposed to the default of 10), and it doesn't seem to be >>>>>> sufficient... >>>>>> Again, this is joining a 33GB table with a 13GB table where join key is >>>>>> shared by virtually all rows in both tables. >>>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>>> >>>>>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote: >>>>>> >>>>>>> Thanks guys, very useful information. I will modify my query a bit >>>>>>> and get back to you guys on whether it worked or not. >>>>>>> >>>>>>> Thanks, >>>>>>> Ryan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Ryan, >>>>>>>> >>>>>>>> I asked this question a couple days ago but in a slightly different >>>>>>>> form. What you have to do is make sure the table you're joining is >>>>>>>> smaller >>>>>>>> than the leftmost table. As an example, >>>>>>>> >>>>>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN >>>>>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE >>>>>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0; >>>>>>>> >>>>>>>> In this query, usertracking is a table that is about 8 or 9 gigs. >>>>>>>> Streamtransfers is a table that is about 4 gigs. As per Zheng's >>>>>>>> comment, I >>>>>>>> omitted UserID's of Null or Zero as there are many rows with this key >>>>>>>> and >>>>>>>> the join worked as intended. >>>>>>>> >>>>>>>> Chris >>>>>>>> >>>>>>>> PS. As an aside, Hive is proving to be quite useful to all of our >>>>>>>> database hackers here at Grooveshark. Thanks to everyone who has >>>>>>>> committed...I hope to contribute soon. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]>wrote: >>>>>>>> >>>>>>>>> It's probably caused by the Cartesian product of many rows from the >>>>>>>>> two tables with the same key. >>>>>>>>> >>>>>>>>> Zheng >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> It also looks like the reducers just never stop outputting things >>>>>>>>>> likethe (following -- see below), causing them to ultimately time >>>>>>>>>> out and >>>>>>>>>> get killed by the system. >>>>>>>>>> >>>>>>>>>> 2009-10-25 22:21:18,879 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 100000000 rows >>>>>>>>>> >>>>>>>>>> 2009-10-25 22:21:22,009 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 >>>>>>>>>> rows >>>>>>>>>> 2009-10-25 22:21:22,010 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 101000000 rows >>>>>>>>>> 2009-10-25 22:21:25,141 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 >>>>>>>>>> rows >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009-10-25 22:21:25,142 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 102000000 rows >>>>>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 >>>>>>>>>> rows >>>>>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 103000000 rows >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 >>>>>>>>>> rows >>>>>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 104000000 rows >>>>>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 >>>>>>>>>> rows >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 105000000 rows >>>>>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 >>>>>>>>>> rows >>>>>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding >>>>>>>>>> 106000000 rows >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello all, >>>>>>>>>>> >>>>>>>>>>> Should I expect to be able to do a Hive JOIN between two tables >>>>>>>>>>> that have about 10 or 15GB of data each? What I'm noticing (for a >>>>>>>>>>> simple >>>>>>>>>>> JOIN) is that all the map tasks complete, but the reducers just >>>>>>>>>>> hang at >>>>>>>>>>> around 87% or so (for the first set of 4 reducers), and then they >>>>>>>>>>> eventually >>>>>>>>>>> just get killed due to inability to respond by the cluster. I can >>>>>>>>>>> do a JOIN >>>>>>>>>>> between a large table and a very small table of 10 or so records >>>>>>>>>>> just fine. >>>>>>>>>>> >>>>>>>>>>> Any thoughts? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Yours, >>>>>>>>> Zheng >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> > >
