Comments in-line below

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: Naama Kraus [mailto:[EMAIL PROTECTED]
> Sent: Sunday, May 18, 2008 4:01 AM
> To: [email protected]
> Subject: Scheme design questions
>
> Hi,
>
> I am trying to figure out how should I design HBase tables
> and I got couple of questions. I'd appreciate some assistance.
>
> Say I have data about students confirming of - Student id and
> some basic information such as first name, last name, gender,
> address, date she started her studies, hobbies and some areas
> of interest.
> Additionally, for each student there is information on the
> course she has taken and the final grade.
>
> My Questions:
> 1. Should the basic attributes (first name, last name, gender
> ...) share a common column family or each should have a
> different family ?

This kind of depends on the access pattern. For example in the
Webtable example, one column contains page content which is usually
processed together and another column contains page attributes
such as encoding, mime-type, etc.

My guess is that your information should share a column family.

> If the second is the way to go, would it
> harm HBase flexibility characteristic which allows adding a
> new type of attribute that may pop up after I defined the
> table scheme? E.g. new data source comes in with the 'age'
> attribute, that was not known upon defining the scheme.

This is the disadvantage of the one column per attribute approach.
It is expensive to add a new column, but new family members
can be added at any time.

> 2. For attributes which may have multiple values, would it
> make sense to define a common column family and add a column
> for each value ?

It might make sense in this case to have a family for the
multi-valued attribute and just add a new member for each new
value.

> 2.1 For hobbies - I'd define a 'hobby' column family under
> which I put each hobby in a separate column. hobby_i (i being
> incremented by 1 for each new hobby being inserted in the
> row) as a column name and the actual hobby as a value ? Or
> I'd rather have the hobby name as a column name and some
> arbitrary value (e.g. 1) as cell value ?

I'd define a family, hobby and use a new family member for
each value, for example:

hobby:video-games
hobby:tennis
hobby:floral-arranging
etc.

> 2.2 Similarly, for grades there could be a common grades
> family. For each course grade, I could put the course id as a
> column name and the course grade as a value. Does it make sense ?

Yes. For example:

Family course:

course:math101 (with value) B
course:economics203 (value) c
etc.

> 3. Say there is the 'zipcode' attribute, and a student may
> have multiple zip codes associated with her. By now, it is a
> case similar to question 2. But what if for each zip I have
> the matching city and state information. Should I create a
> separate table with each row containing a zip and the
> corresponding city and state and use join at query time if
> needed ?

There is no join operation in HBase. However, you could run
a map/reduce job to do something like a join.

For zipcode, I might do something like:

Family zip:

zip:12345 (value) home
zip:09876 (value) school
etc.

> Or is there a way to de-normalize the data and
> somehow integrate the multiple zip-s plus the city and state
> of each within the original students table ?

It is a little tricky to store multi-value attributes in a
colum that is multivalued.

For example if the row key is the student name, you could
have something like:

Family info:
info:id
info:address
info:zip1
info:zip2

or:

info:id
info:address
info:zip (value is a serialized map of zipcode, location)

> To what extent should I aspire to denormalize data ?

Again it depends on your access patterns. If the data is going
to be accessed together, it is probably better to put them
in the same family. If you know that some data will never (or
rarely) be accessed togetether, then put them in separate
column families.

> 4. Can columns of different types (numbers/text/date) share
> the same column family ?

There are no data type in HBase. All values are byte[]

> Thanks for any help, Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> intelligent, read them fairy tales. If you want them to be
> more intelligent, read them more fairy tales." (Albert
> Einstein)
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> Date: 5/17/2008 6:26 PM
>
No virus found in this outgoing message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: 5/17/2008 
6:26 PM

Reply via email to