Hi,
I did a quick search of the Hbase-0.1.3 and Hbase-0.2.0 code and
couldn't find any constants that would provide an answer to the first
of two questions. In a Google search I did find in the Hbase archive
a comment that there is (was) no limit on row-key length in
Hbase-0.1.x and no limit on cell size was enforced in Hbase-0.1.x
other than a cell cannot be larger than the maximum region size.
So perhaps the first question I was going to ask about limits on row-
key length is better changed to ask if anyone has any empirical
knowledge to share whether there is some upper bound on row-key length
that insures best query performance?
This question actually arises out of a second question about best
practices in table design. We have a table in which each item of
interest could have a key with one primary component "K1" and two
additional secondary components "K2", "K3". We will be keeping "many"
versions of this item, differentiated by timestamp.
Is there any knowledge out there about which of these four options
would be hypothesized to generally be the highest performance design
in the absence of any empirical results or odd data patterns?
1) The row-key is the concatenation "K1::K2::K3" of the three keys
with inter-key separators chosen to make regular expression matching
on the components easy. The multiple copies of the item with a
specific set of values for the three key components, but different
time stamps, are stored as versions of the item in the row.
2) The row-key "R"is sufficient to be unique for each timestampped
version of an item and the keys "K1", "K2" and "K3" for the item are
columns in a single column family. In this case, each version of the
item is stored in a single row, and "R" would be generated from "K1"
and the timestamp in way that gives good grouping to items with the
same value of "K1" under lexicographic ordering.
3) Combining 1) and 2), where the row key is "K1::K2::K3::T" and "T"
is an externally generated timestamp so that each version of each item
is stored in a separate row.
4) Combining 1) and 2) differently, where the row-key "R" is generated
as a 1-1 alias for "K1::K2::K3" and the keys "K1", "K2" and "K3" for
the item are columns in a single column family. The multiple copies
of the item with a specific set of values for the three key
components, but different time stamps, are stored as versions of the
item in the row.
Right now we are using 4). This is conceptually simple and works
well, but before we lock this down we thought we ought to consider the
other options. 1) has some attraction because it should take less
space for key storage. But disk is cheap, right? 2) would seem to have
has some advantages for queries. The combination would seem to be the
best of both, but at the same time is not necessarily the best if 2)
actually performs significantly worse than 1) for some reason related
to only storing a single version per row.
Thanks,
Rick