Also, if I try your query I get: FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in expression specification
Seems like it doesn't like SELECT (DISTINCT ...) On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]> wrote: > Ah, got it! > > Can you explain your semijoin query though? I see that you are using JOIN > but I don't see the ON (...) part. What does Hive do in this case when ON > (..) doesn't exist? > > > > On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]> wrote: > >> I see. If you just want all userid in one table that also appear in >> another table, semijoin would help (it is still under implementation >> though). You can simulate semijoin using JOIN as follows: >> >> select (distinct ut.userid) from usertracking ut join (select >> usertrackingid from streamtransfers group by usertrackingid where >> usertrackingid is not null and usertrackingid<>0) where userid is not null >> and userid <> 0; >> >> BTW, the rewritten query using UNION ALL is not semantically equivalent to >> the original query: it doesn't return "common" user IDs in both table, but >> just the "union" of them. >> >> Ning >> >> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote: >> >> Hi Ning, >> >> Basically, I have roughly 60GB of log data that is logically broken up >> into two Hive tables, which works out to be those two tables that I had >> mentioned. Both of these tables share a common key, but this key appears in >> virtually every row of each of the two tables. Each of these tables just has >> 1 partition (by date). My query is very similar (although with different >> data/columns) to Chris Bates' query: >> >> >> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers >> ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL >> AND UT.UserID <> 0; >> >> However, in the above example imagine that in both tables the same users >> appear a lot, so there are lots of matches. >> >> >> >> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]> wrote: >> >>> If it is really a Cartesian product, there is no better way other than >>> increasing the timeout for the reducers. You can do a back-of-the-envelope >>> calculation on how long it takes (e.g., in your log it shows it takes 19 >>> sec. to get 6 million rows out of the join). This calculation can also be >>> done if it is not a Cartesian product, and you have an good estimate of how >>> many rows will be produced. >>> >>> In general, joining two huge tables can be avoided by partitioning the >>> fact tables, so that you don't need to join the whole table. >>> >>> BTW, are you joining two fact tables or one dimension table is just huge? >>> >>> Thanks, >>> Ning >>> >>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote: >>> >>> So it turns out that the JOIN key of my query basically results in a >>> match/join on all rows of each table! There really is no extra filtering >>> that I can do to exclude invalid rows, etc. The mappers fly by and complete, >>> but the reducers are just moving extremely slowly (my guess due to what >>> Zheng said about the Cartesian product of all rows getting matched). >>> >>> Is there some other way that I could re-write the JOIN or is my only >>> option to increase the timeout on the task trackers so that they don't >>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes >>> (as opposed to the default of 10), and it doesn't seem to be sufficient... >>> Again, this is joining a 33GB table with a 13GB table where join key is >>> shared by virtually all rows in both tables. >>> >>> Thanks, >>> Ryan >>> >>> >>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote: >>> >>>> Thanks guys, very useful information. I will modify my query a bit and >>>> get back to you guys on whether it worked or not. >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>> >>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates < >>>> [email protected]> wrote: >>>> >>>>> Ryan, >>>>> >>>>> I asked this question a couple days ago but in a slightly different >>>>> form. What you have to do is make sure the table you're joining is >>>>> smaller >>>>> than the leftmost table. As an example, >>>>> >>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN >>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE >>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0; >>>>> >>>>> In this query, usertracking is a table that is about 8 or 9 gigs. >>>>> Streamtransfers is a table that is about 4 gigs. As per Zheng's >>>>> comment, I >>>>> omitted UserID's of Null or Zero as there are many rows with this key and >>>>> the join worked as intended. >>>>> >>>>> Chris >>>>> >>>>> PS. As an aside, Hive is proving to be quite useful to all of our >>>>> database hackers here at Grooveshark. Thanks to everyone who has >>>>> committed...I hope to contribute soon. >>>>> >>>>> >>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote: >>>>> >>>>>> It's probably caused by the Cartesian product of many rows from the >>>>>> two tables with the same key. >>>>>> >>>>>> Zheng >>>>>> >>>>>> >>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote: >>>>>> >>>>>>> It also looks like the reducers just never stop outputting things >>>>>>> likethe (following -- see below), causing them to ultimately time out >>>>>>> and >>>>>>> get killed by the system. >>>>>>> >>>>>>> 2009-10-25 22:21:18,879 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 >>>>>>> rows >>>>>>> >>>>>>> 2009-10-25 22:21:22,009 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 rows >>>>>>> 2009-10-25 22:21:22,010 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 >>>>>>> rows >>>>>>> 2009-10-25 22:21:25,141 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 rows >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2009-10-25 22:21:25,142 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 >>>>>>> rows >>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 rows >>>>>>> 2009-10-25 22:21:28,263 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 >>>>>>> rows >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 rows >>>>>>> 2009-10-25 22:21:31,387 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 >>>>>>> rows >>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 rows >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2009-10-25 22:21:34,510 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 >>>>>>> rows >>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 rows >>>>>>> 2009-10-25 22:21:37,633 INFO >>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 >>>>>>> rows >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hello all, >>>>>>>> >>>>>>>> Should I expect to be able to do a Hive JOIN between two tables that >>>>>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple >>>>>>>> JOIN) is >>>>>>>> that all the map tasks complete, but the reducers just hang at around >>>>>>>> 87% or >>>>>>>> so (for the first set of 4 reducers), and then they eventually just get >>>>>>>> killed due to inability to respond by the cluster. I can do a JOIN >>>>>>>> between a >>>>>>>> large table and a very small table of 10 or so records just fine. >>>>>>>> >>>>>>>> Any thoughts? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ryan >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Yours, >>>>>> Zheng >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> >
