Table design question

Rick Hangartner Thu, 24 Jul 2008 18:57:19 -0700

Hi,

I did a quick search of the Hbase-0.1.3 and Hbase-0.2.0 code andcouldn't find any constants that would provide an answer to the firstof two questions. In a Google search I did find in the Hbase archivea comment that there is (was) no limit on row-key length inHbase-0.1.x and no limit on cell size was enforced in Hbase-0.1.xother than a cell cannot be larger than the maximum region size.

So perhaps the first question I was going to ask about limits on row-key length is better changed to ask if anyone has any empiricalknowledge to share whether there is some upper bound on row-key lengththat insures best query performance?

This question actually arises out of a second question about bestpractices in table design. We have a table in which each item ofinterest could have a key with one primary component "K1" and twoadditional secondary components "K2", "K3". We will be keeping "many"versions of this item, differentiated by timestamp.

Is there any knowledge out there about which of these four optionswould be hypothesized to generally be the highest performance designin the absence of any empirical results or odd data patterns?

1) The row-key is the concatenation "K1::K2::K3" of the three keyswith inter-key separators chosen to make regular expression matchingon the components easy. The multiple copies of the item with aspecific set of values for the three key components, but differenttime stamps, are stored as versions of the item in the row.

2) The row-key "R"is sufficient to be unique for each timestamppedversion of an item and the keys "K1", "K2" and "K3" for the item arecolumns in a single column family. In this case, each version of theitem is stored in a single row, and "R" would be generated from "K1"and the timestamp in way that gives good grouping to items with thesame value of "K1" under lexicographic ordering.

3) Combining 1) and 2), where the row key is "K1::K2::K3::T" and "T"is an externally generated timestamp so that each version of each itemis stored in a separate row.

4) Combining 1) and 2) differently, where the row-key "R" is generatedas a 1-1 alias for "K1::K2::K3" and the keys "K1", "K2" and "K3" forthe item are columns in a single column family. The multiple copiesof the item with a specific set of values for the three keycomponents, but different time stamps, are stored as versions of theitem in the row.

Right now we are using 4). This is conceptually simple and workswell, but before we lock this down we thought we ought to consider theother options. 1) has some attraction because it should take lessspace for key storage. But disk is cheap, right? 2) would seem to havehas some advantages for queries. The combination would seem to be thebest of both, but at the same time is not necessarily the best if 2)actually performs significantly worse than 1) for some reason relatedto only storing a single version per row.


Thanks,
Rick

Table design question

Reply via email to