Regards 1. in the below, HBase store is closer to the A format described
below.
A table has column families. Each column family is written to a
HStore. A HStore has HStoreFiles (~SSTables in bigtable-speak).
HStoreFiles are hadoop MapFiles where the key is
row/columnname/timestamp (See HStoreKey class) and the value is
"<html>...." as bytes.
Regards 2., how will you be accessing the data subsequently?
St.Ack
Bin YANG wrote:
hello,
I have several questions on the physical storage of the HBase:
1. Does HBase store each table in A format:
"com.cnn.www", t6, "<html>...",
"com.cnn.www", t5, "<html>...",
"com.cnn.www", t3, "<html>...",
"com.sohu.www", t8, "<html>..."
"com.sohu.www", t7, "<html>..."
or B fomat:
"com.cnn.www", t6, "<html>...",
t5, "<html>...",
t3, "<html>...",
"com.sohu.www", t8, "<html>...",
t7, "<html>..."
A format treat RowKey and TimeStamp as key, and wastes space of the
RowKey "com.cnn.www" or "com.sohu.www"several times.
While B format treat RowKey as key, and TimeStamp and Column as
attributes. And each row doesn't maintain the same format.
2. Another question, maybe we will get several labels in the same
family at the same time. For example, we will crawl a web page at time
t1, and the page contains 2 anchors, one is a.com, the other is b.com.
How to store it in hbase?
"com.cnn.www", t1, "anchor:a.com", "aaa",
"com.cnn.www", t1, "anchor:b.com", "bbb",
"com.cnn.www", t2, "anchor:c.com", "ccc"
or
"com.cnn.www", t1, "anchor:a.com", "aaa",
"anchor:b.com", "bbb",
"com.cnn.www", t2, "anchor:c.com", "ccc"
thanks!
Bin YANG