Thanks again, Naama On Mon, May 19, 2008 at 9:32 AM, Jim Kellerman <[EMAIL PROTECTED]> wrote:
> Comments inline. > > --- > Jim Kellerman, Senior Engineer; Powerset > > > > -----Original Message----- > > From: Naama Kraus [mailto:[EMAIL PROTECTED] > > Sent: Sunday, May 18, 2008 9:03 PM > > To: [email protected] > > Subject: Re: Scheme design questions > > > > Thank you very much Jim for the useful information. > > My further questions inlined (within <<< ... >>>). > > > > Another question - what are the limits on number of families > > and number of family members within one table ? > > Currently, there are no limits to the number of column families you > can create. However, Google's Bigtable paper says that you should expect > some limit (in the hundreds, i.e., < 999) but neither Bigtable nor HBase > limit you on the number of family members. See below for explanation. > > > Are there any limits to the overall size of a data stored in a table ? > > There are no architectural limits to the size of a table. > > > More below > > > Naama > > > > On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman > > <[EMAIL PROTECTED]> wrote: > > > > > Comments in-line below > > > > > > --- > > > Jim Kellerman, Senior Engineer; Powerset > > > > > > > > > > -----Original Message----- > > > > From: Naama Kraus [mailto:[EMAIL PROTECTED] > > > > Sent: Sunday, May 18, 2008 4:01 AM > > > > To: [email protected] > > > > Subject: Scheme design questions > > > > > > > > Hi, > > > > > > > > I am trying to figure out how should I design HBase > > tables and I got > > > > couple of questions. I'd appreciate some assistance. > > > > > > > > Say I have data about students confirming of - Student id > > and some > > > > basic information such as first name, last name, gender, address, > > > > date she started her studies, hobbies and some areas of interest. > > > > Additionally, for each student there is information on the course > > > > she has taken and the final grade. > > > > > > > > My Questions: > > > > 1. Should the basic attributes (first name, last name, gender > > > > ...) share a common column family or each should have a different > > > > family ? > > > > > > This kind of depends on the access pattern. For example in the > > > Webtable example, one column contains page content which is usually > > > processed together and another column contains page > > attributes such as > > > encoding, mime-type, etc. > > > > > > My guess is that your information should share a column family. > > > > > > <<< So does this mean that a column family is stored together > > ? In the documentation I read that regions are stored > > together, but I thought regions are bunch of rows, each > > containing all columns. So I am now confused, rows or columns > > ? Could you please explain ? >>> > > Yes, HBase is a column oriented data store just like Bigtable. > Adding new family members is cheap, new columns expensive. > > Regions are indeed a bunch of rows. A single region represents > a row range from [low-key:high-key). For each region there is > an HStore for each column family that has data in the region's > row range. > > > > > > > > If the second is the way to go, would it harm HBase flexibility > > > > characteristic which allows adding a new type of > > attribute that may > > > > pop up after I defined the table scheme? E.g. new data > > source comes > > > > in with the 'age' > > > > attribute, that was not known upon defining the scheme. > > > > > > This is the disadvantage of the one column per attribute approach. > > > It is expensive to add a new column, but new family members can be > > > added at any time. > > > > > > <<< Can a column be added to an existing table then, or only > > prior to create ? In what sense is it expensive to add a new > > column ? >>> > > You can add a new column to an existing table, but you must first > 'disable' the table (take it offline). It is expensive, because adding > a new column family means creating a new HStore for each existing region. > > > > > > > > > > > 2. For attributes which may have multiple values, would it make > > > > sense to define a common column family and add a column for each > > > > value ? > > > > > > It might make sense in this case to have a family for the > > multi-valued > > > attribute and just add a new member for each new value. > > > > > > > 2.1 For hobbies - I'd define a 'hobby' column family > > under which I > > > > put each hobby in a separate column. hobby_i (i being > > incremented by > > > > 1 for each new hobby being inserted in the > > > > row) as a column name and the actual hobby as a value ? Or I'd > > > > rather have the hobby name as a column name and some > > arbitrary value > > > > (e.g. 1) as cell value ? > > > > > > I'd define a family, hobby and use a new family member for > > each value, > > > for example: > > > > > > hobby:video-games > > > hobby:tennis > > > hobby:floral-arranging > > > etc. > > > > > > > 2.2 Similarly, for grades there could be a common grades > > family. For > > > > each course grade, I could put the course id as a column name and > > > > the course grade as a value. Does it make sense ? > > > > > > Yes. For example: > > > > > > Family course: > > > > > > course:math101 (with value) B > > > course:economics203 (value) c > > > etc. > > > > > > > 3. Say there is the 'zipcode' attribute, and a student may have > > > > multiple zip codes associated with her. By now, it is a > > case similar > > > > to question 2. But what if for each zip I have the > > matching city and > > > > state information. Should I create a separate table with each row > > > > containing a zip and the corresponding city and state and > > use join > > > > at query time if needed ? > > > > > > There is no join operation in HBase. However, you could run a > > > map/reduce job to do something like a join. > > > > > > <<< Is there somewhere a code sample for doing map/reduce > > jon-like above HBase ? >>> > > The best examples we have available for using HBase with map/reduce are > in the test cases (see org.apache.hadoop.hbase.mapred.*) > > > > > > > > > > > > For zipcode, I might do something like: > > > > > > Family zip: > > > > > > zip:12345 (value) home > > > zip:09876 (value) school > > > etc. > > > > > > > Or is there a way to de-normalize the data and somehow > > integrate the > > > > multiple zip-s plus the city and state of each within the > > original > > > > students table ? > > > > > > It is a little tricky to store multi-value attributes in a > > colum that > > > is multivalued. > > > > > > For example if the row key is the student name, you could have > > > something like: > > > > > > Family info: > > > info:id > > > info:address > > > info:zip1 > > > info:zip2 > > > > > > or: > > > > > > info:id > > > info:address > > > info:zip (value is a serialized map of zipcode, location) > > > > > > > To what extent should I aspire to denormalize data ? > > > > > > Again it depends on your access patterns. If the data is > > going to be > > > accessed together, it is probably better to put them in the same > > > family. If you know that some data will never (or > > > rarely) be accessed togetether, then put them in separate column > > > families. > > > > > > > 4. Can columns of different types (numbers/text/date) > > share the same > > > > column family ? > > > > > > There are no data type in HBase. All values are byte[] > > > > > > > Thanks for any help, Naama > > > > > > > > -- > > > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo > > 00 oo 00 oo > > > > 00 oo 00 oo 00 oo "If you want your children to be > > intelligent, read > > > > them fairy tales. If you want them to be more > > intelligent, read them > > > > more fairy tales." (Albert > > > > Einstein) > > > > > > > > No virus found in this incoming message. > > > > Checked by AVG. > > > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release > > > > Date: 5/17/2008 6:26 PM > > > > > > > No virus found in this outgoing message. > > > Checked by AVG. > > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date: > > > 5/17/2008 > > > 6:26 PM > > > > > > > > > > > -- > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 > > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be > > intelligent, read them fairy tales. If you want them to be > > more intelligent, read them more fairy tales." (Albert > > Einstein) > > > > No virus found in this incoming message. > > Checked by AVG. > > Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release > > Date: 5/18/2008 9:31 AM > > > No virus found in this outgoing message. > Checked by AVG. > Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release Date: 5/18/2008 > 9:31 AM > -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
