All the columns for any row key will be stored on one server hosted by one
region
the regions are split by row key not columns
So all the columns for rowx will be only in one region on one server.
A table is made up of regions 1 to start with as more rows are added the
regions split by row
each region holds a range of the rows and all the columns for its key row
range.
Billy
"Ric Wang" <[email protected]> wrote in
message news:[email protected]...
Billy,
By saying "columns for key1 will not be on all the nodes but just one node
in the cluster", you really mean "columns of the SAME family for key1...",
right?
Please correct me if I am wrong, but I think for the row key "key1", the
data value of "familyA:lableX" and that of "familyB:labelY" can still be
stored on two different nodes because they are in two different families.
Is
that correct?
Thanks in advance for your clarification.
-Ric
On Tue, Jun 9, 2009 at 8:35 PM, Billy Pearson
<[email protected]>wrote:
You should read over the
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
The data is sorted by row key, then column:label, timestamp
In that order so if you have row key1 all the labels for columnval1 will
be
stored together in the same file
We do flush more the one file to disk as data is added so the values are
not always stored together until after a major compaction/merge all store
files together
But what we mean by stored together is all column1 will be stored in one
file and column2 would be stored in a separate set of files so if you
only
one data from column1 then you only need to read the data from one set of
files not all the columns for that row key.
also columns for key1 will not be on all the nodes but just one node in
the
cluster. The table is split by the key values so keys1-100 would be one
region keys101-200 would be another region all in the same table
We split when the size get to large they split and become two regions and
so on.
So we look up a key we only have to look at one server
Billy
"Ric Wang" <[email protected]> wrote in
message
news:[email protected]...
Hi,
Very new to Hadoop and HBase. And sorry about the rudimentary question:
I store my artifacts as rows in an HBase table, and the attributes of
each
artifact as labels within one single column family (ex. myFamily). I may
have tens of thousands of labels, and millions and millions of rows. Now
as
the data size grows, some document says that, the values of one family
will
be "stored together". I wonder what that really means.
For example, for a given row key (my.key.123), will HBase guarantee that
ALL
its attributes (ie. the values of ALL the labels in "myFamily") of that
row
key be stored on one physical/grid node? In other words, if I want to
find
out ONE contain matching row key "my.key.123" based on its attributes
(column values), at the implementation level, will HBase be
1. traversing all the distributed nodes and interrogating the column
values;
aggregating the results coming from all the nodes; and finally finding
out
the matching row key
or
2. doing atomic operations in parallel on each node locally; and
finally,
only one node will return the matching row key (if there is a match).
My guess is the that the answer depends on if all attributes (in
myFamily)
of a given row are stored on one and only one node.
Hope I didn't make my question very confusing. Very new to column based
database; please help and bare with me.
Thanks!
Ric
--
Ric Wang
[email protected]