Thank you very much Jim for the useful information.
My further questions inlined (within <<< ... >>>).

Another question - what are the limits on number of families and number of
family members within one table ?
Are there any limits to the overall size of a data stored in a table ?

Naama

On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman <[EMAIL PROTECTED]> wrote:

> Comments in-line below
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
>
>
> > -----Original Message-----
> > From: Naama Kraus [mailto:[EMAIL PROTECTED]
> > Sent: Sunday, May 18, 2008 4:01 AM
> > To: [email protected]
> > Subject: Scheme design questions
> >
> > Hi,
> >
> > I am trying to figure out how should I design HBase tables
> > and I got couple of questions. I'd appreciate some assistance.
> >
> > Say I have data about students confirming of - Student id and
> > some basic information such as first name, last name, gender,
> > address, date she started her studies, hobbies and some areas
> > of interest.
> > Additionally, for each student there is information on the
> > course she has taken and the final grade.
> >
> > My Questions:
> > 1. Should the basic attributes (first name, last name, gender
> > ...) share a common column family or each should have a
> > different family ?
>
> This kind of depends on the access pattern. For example in the
> Webtable example, one column contains page content which is usually
> processed together and another column contains page attributes
> such as encoding, mime-type, etc.
>
> My guess is that your information should share a column family.


<<< So does this mean that a column family is stored together ? In the
documentation I read that regions are stored together, but I thought regions
are bunch of rows, each containing all columns. So I am now confused, rows
or columns ? Could you please explain ? >>>

>
> > If the second is the way to go, would it
> > harm HBase flexibility characteristic which allows adding a
> > new type of attribute that may pop up after I defined the
> > table scheme? E.g. new data source comes in with the 'age'
> > attribute, that was not known upon defining the scheme.
>
> This is the disadvantage of the one column per attribute approach.
> It is expensive to add a new column, but new family members
> can be added at any time.


<<< Can a column be added to an existing table then, or only prior to create
? In what sense is it expensive to add a new column ? >>>

>
>
> > 2. For attributes which may have multiple values, would it
> > make sense to define a common column family and add a column
> > for each value ?
>
> It might make sense in this case to have a family for the
> multi-valued attribute and just add a new member for each new
> value.
>
> > 2.1 For hobbies - I'd define a 'hobby' column family under
> > which I put each hobby in a separate column. hobby_i (i being
> > incremented by 1 for each new hobby being inserted in the
> > row) as a column name and the actual hobby as a value ? Or
> > I'd rather have the hobby name as a column name and some
> > arbitrary value (e.g. 1) as cell value ?
>
> I'd define a family, hobby and use a new family member for
> each value, for example:
>
> hobby:video-games
> hobby:tennis
> hobby:floral-arranging
> etc.
>
> > 2.2 Similarly, for grades there could be a common grades
> > family. For each course grade, I could put the course id as a
> > column name and the course grade as a value. Does it make sense ?
>
> Yes. For example:
>
> Family course:
>
> course:math101 (with value) B
> course:economics203 (value) c
> etc.
>
> > 3. Say there is the 'zipcode' attribute, and a student may
> > have multiple zip codes associated with her. By now, it is a
> > case similar to question 2. But what if for each zip I have
> > the matching city and state information. Should I create a
> > separate table with each row containing a zip and the
> > corresponding city and state and use join at query time if
> > needed ?
>
> There is no join operation in HBase. However, you could run
> a map/reduce job to do something like a join.


<<< Is there somewhere a code sample for doing map/reduce jon-like above
HBase ? >>>

>
>
> For zipcode, I might do something like:
>
> Family zip:
>
> zip:12345 (value) home
> zip:09876 (value) school
> etc.
>
> > Or is there a way to de-normalize the data and
> > somehow integrate the multiple zip-s plus the city and state
> > of each within the original students table ?
>
> It is a little tricky to store multi-value attributes in a
> colum that is multivalued.
>
> For example if the row key is the student name, you could
> have something like:
>
> Family info:
> info:id
> info:address
> info:zip1
> info:zip2
>
> or:
>
> info:id
> info:address
> info:zip (value is a serialized map of zipcode, location)
>
> > To what extent should I aspire to denormalize data ?
>
> Again it depends on your access patterns. If the data is going
> to be accessed together, it is probably better to put them
> in the same family. If you know that some data will never (or
> rarely) be accessed togetether, then put them in separate
> column families.
>
> > 4. Can columns of different types (numbers/text/date) share
> > the same column family ?
>
> There are no data type in HBase. All values are byte[]
>
> > Thanks for any help, Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read them fairy tales. If you want them to be
> > more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> > Date: 5/17/2008 6:26 PM
> >
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: 5/17/2008
> 6:26 PM
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Reply via email to