On Wed, Jul 21, 2010 at 3:38 PM, Renato Marroquín Mogrovejo <[email protected]> wrote: > Does anybody else have any Hive experience that would like to share? Please? > Any kind of suggestion is highly appreciated. > Thanks in advanced. > > > Renato M. > > 2010/7/19 Renato Marroquín Mogrovejo <[email protected]> >> >> Hi Ashish, >> >> I mean if there are like modeling best practices in order to obtain better >> performance (buckets, partitions, tables related), e.g. maybe creating >> different partitions considering not just time frames but maybe also >> partition size, or for example in Hive's paper, the list partitioning that >> the compiler uses to know where to look for the data, or I dunno those kind >> of modeling related things. >> Or is it just to choose between the well known Kimball or Inmon >> approaches? >> Thanks in advanced. >> >> >> Renato M. >> >> 2010/7/16 Ashish Thusoo <[email protected]> >>> >>> Hi Renato, >>> >>> Can you expand more on what exactly you mean by modelling? >>> >>> On the append side, Hive does not really support appends though you can >>> create a new partition within the table for every run and that could be used >>> as a work around for appends. >>> >>> Ashish >>> ________________________________ >>> From: Renato Marroquín Mogrovejo [mailto:[email protected]] >>> Sent: Thursday, July 15, 2010 2:53 PM >>> To: [email protected] >>> Subject: Hive Usage >>> >>> Hi there I would like to know if there is anyone who has done some kind >>> of modelling on Hive, and is willing to share some experiences please. >>> Today is my first day with Hive, and I have several doubts regarding to >>> the modelling, if I would have to do a special modelling, or a regular DW >>> one.ç >>> And another thing I wanted to know is if Hive already has the append >>> option enabled, because I know there is a hadoop branch with the append >>> option enabled and also a cloudera release does (I think it is the CHD3). >>> Please any kind of suggestion or opinions are highly appreciated. >>> >>> >>> Renato M. >> > >
Most people partition data by time. For example, if your table is partitioned by day you may not need "append" you can insert your data into a partition (day=20101005). With this, a query that uses DAY anywhere in the where clause ( WHERE day=20101005) can use partition pruning and only scan only the data for that day. Over-partitioning a scheme like day/month/hour/minute is not good if you end up with numerous partitions and each of them have less data then the smallest DFS block (128MB.) Buckets are ideal if you have to join data. If you have two tables bucketed on the same column and you join on that column, or wish to sample this is very efficient. Bucketing is implemented as multiple files in the table or partition, getting the files into the buckets usually requires a query with # of reducers = number of buckets. We do not do much joins or samples so we do not need this as much. Hive is row oriented storage (with the exception of the columnar SerDe) and I would guess that most are very denormalized. This is probably due hive does not have indexes and constraints, keeping normalized data in sync is difficult. While map reduce joins are efficient for large data sets, star-schema is probably not where you want to be. In general, I find that since hive is great with raw data do not over think your modelling. Just get your raw data in there. Reports are aggregations which come from querying your raw data and outputting other tables are where your thinking comes in. Common place is hourly or daily role ups of reports that matter to people. select url,count(1) from pageviews group by (url) where (day=20100101) insert into pageviewreport partition(day=20100101) Hope this helps, Edward Hive FTW
