Re: Hive Usage

Edward Capriolo Wed, 21 Jul 2010 13:16:25 -0700

On Wed, Jul 21, 2010 at 3:38 PM, Renato Marroquín Mogrovejo
<[email protected]> wrote:
> Does anybody else have any Hive experience that would like to share? Please?
> Any kind of suggestion is highly appreciated.
> Thanks in advanced.
>
>
> Renato M.
>
> 2010/7/19 Renato Marroquín Mogrovejo <[email protected]>
>>
>> Hi Ashish,
>>
>> I mean if there are like modeling best practices in order to obtain better
>> performance (buckets, partitions, tables related), e.g. maybe creating
>> different partitions considering not just time frames but maybe also
>> partition size, or for example in Hive's paper, the list partitioning that
>> the compiler uses to know where to look for the data, or I dunno those kind
>> of modeling related things.
>> Or is it just to choose between the well known Kimball or Inmon
>> approaches?
>> Thanks in advanced.
>>
>>
>> Renato M.
>>
>> 2010/7/16 Ashish Thusoo <[email protected]>
>>>
>>> Hi Renato,
>>>
>>> Can you expand more on what exactly you mean by modelling?
>>>
>>> On the append side, Hive does not really support appends though you can
>>> create a new partition within the table for every run and that could be used
>>> as a work around for appends.
>>>
>>> Ashish
>>> ________________________________
>>> From: Renato Marroquín Mogrovejo [mailto:[email protected]]
>>> Sent: Thursday, July 15, 2010 2:53 PM
>>> To: [email protected]
>>> Subject: Hive Usage
>>>
>>> Hi there I would like to know if there is anyone who has done some kind
>>> of modelling on Hive, and is willing to share some experiences please.
>>> Today is my first day with Hive, and I have several doubts regarding to
>>> the modelling, if I would have to do a special modelling, or a regular DW
>>> one.ç
>>> And another thing I wanted to know is if Hive already has the append
>>> option enabled, because I know there is a hadoop branch with the append
>>> option enabled and also a cloudera release does (I think it is the CHD3).
>>> Please any kind of suggestion or opinions are highly appreciated.
>>>
>>>
>>> Renato M.
>>
>
>


Most people partition data by time. For example, if your table is
partitioned by day you may not need "append" you can insert your data
into a partition (day=20101005). With this, a query that uses DAY
anywhere in the where clause ( WHERE day=20101005) can use partition
pruning and only scan only the data for that day. Over-partitioning a
scheme like day/month/hour/minute is not good if you end up with
numerous partitions and each of them have less data then the smallest
DFS block (128MB.)

Buckets are ideal if you have to join data. If you have two tables
bucketed on the same column and you join on that column, or wish to
sample this is very efficient. Bucketing is implemented as multiple
files in the table or partition, getting the files into the buckets
usually requires a query with # of reducers = number of buckets. We do
not do much joins or samples so we do not need this as much.

Hive is row oriented storage (with the exception of the columnar
SerDe) and I would guess that most are very denormalized. This is
probably due hive does not have indexes and constraints, keeping
normalized data in sync is difficult. While map reduce joins are
efficient for large data sets, star-schema is probably not where you
want to be.

In general, I find that since hive is great with raw data do not over
think your modelling. Just get your raw data in there. Reports are
aggregations which come from querying your raw data and outputting
other tables are where your thinking comes in. Common place is hourly
or daily role ups of reports that matter to people.

select url,count(1) from pageviews group by (url) where (day=20100101)
insert into pageviewreport partition(day=20100101)

Hope this helps,
Edward
Hive FTW

Re: Hive Usage

Reply via email to