Thank you very much Jim for the useful information. My further questions inlined (within <<< ... >>>).
Another question - what are the limits on number of families and number of family members within one table ? Are there any limits to the overall size of a data stored in a table ? Naama On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman <[EMAIL PROTECTED]> wrote: > Comments in-line below > > --- > Jim Kellerman, Senior Engineer; Powerset > > > > -----Original Message----- > > From: Naama Kraus [mailto:[EMAIL PROTECTED] > > Sent: Sunday, May 18, 2008 4:01 AM > > To: [email protected] > > Subject: Scheme design questions > > > > Hi, > > > > I am trying to figure out how should I design HBase tables > > and I got couple of questions. I'd appreciate some assistance. > > > > Say I have data about students confirming of - Student id and > > some basic information such as first name, last name, gender, > > address, date she started her studies, hobbies and some areas > > of interest. > > Additionally, for each student there is information on the > > course she has taken and the final grade. > > > > My Questions: > > 1. Should the basic attributes (first name, last name, gender > > ...) share a common column family or each should have a > > different family ? > > This kind of depends on the access pattern. For example in the > Webtable example, one column contains page content which is usually > processed together and another column contains page attributes > such as encoding, mime-type, etc. > > My guess is that your information should share a column family. <<< So does this mean that a column family is stored together ? In the documentation I read that regions are stored together, but I thought regions are bunch of rows, each containing all columns. So I am now confused, rows or columns ? Could you please explain ? >>> > > > If the second is the way to go, would it > > harm HBase flexibility characteristic which allows adding a > > new type of attribute that may pop up after I defined the > > table scheme? E.g. new data source comes in with the 'age' > > attribute, that was not known upon defining the scheme. > > This is the disadvantage of the one column per attribute approach. > It is expensive to add a new column, but new family members > can be added at any time. <<< Can a column be added to an existing table then, or only prior to create ? In what sense is it expensive to add a new column ? >>> > > > > 2. For attributes which may have multiple values, would it > > make sense to define a common column family and add a column > > for each value ? > > It might make sense in this case to have a family for the > multi-valued attribute and just add a new member for each new > value. > > > 2.1 For hobbies - I'd define a 'hobby' column family under > > which I put each hobby in a separate column. hobby_i (i being > > incremented by 1 for each new hobby being inserted in the > > row) as a column name and the actual hobby as a value ? Or > > I'd rather have the hobby name as a column name and some > > arbitrary value (e.g. 1) as cell value ? > > I'd define a family, hobby and use a new family member for > each value, for example: > > hobby:video-games > hobby:tennis > hobby:floral-arranging > etc. > > > 2.2 Similarly, for grades there could be a common grades > > family. For each course grade, I could put the course id as a > > column name and the course grade as a value. Does it make sense ? > > Yes. For example: > > Family course: > > course:math101 (with value) B > course:economics203 (value) c > etc. > > > 3. Say there is the 'zipcode' attribute, and a student may > > have multiple zip codes associated with her. By now, it is a > > case similar to question 2. But what if for each zip I have > > the matching city and state information. Should I create a > > separate table with each row containing a zip and the > > corresponding city and state and use join at query time if > > needed ? > > There is no join operation in HBase. However, you could run > a map/reduce job to do something like a join. <<< Is there somewhere a code sample for doing map/reduce jon-like above HBase ? >>> > > > For zipcode, I might do something like: > > Family zip: > > zip:12345 (value) home > zip:09876 (value) school > etc. > > > Or is there a way to de-normalize the data and > > somehow integrate the multiple zip-s plus the city and state > > of each within the original students table ? > > It is a little tricky to store multi-value attributes in a > colum that is multivalued. > > For example if the row key is the student name, you could > have something like: > > Family info: > info:id > info:address > info:zip1 > info:zip2 > > or: > > info:id > info:address > info:zip (value is a serialized map of zipcode, location) > > > To what extent should I aspire to denormalize data ? > > Again it depends on your access patterns. If the data is going > to be accessed together, it is probably better to put them > in the same family. If you know that some data will never (or > rarely) be accessed togetether, then put them in separate > column families. > > > 4. Can columns of different types (numbers/text/date) share > > the same column family ? > > There are no data type in HBase. All values are byte[] > > > Thanks for any help, Naama > > > > -- > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 > > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be > > intelligent, read them fairy tales. If you want them to be > > more intelligent, read them more fairy tales." (Albert > > Einstein) > > > > No virus found in this incoming message. > > Checked by AVG. > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release > > Date: 5/17/2008 6:26 PM > > > No virus found in this outgoing message. > Checked by AVG. > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: 5/17/2008 > 6:26 PM > -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
