hello,
I have several questions on the physical storage of the HBase:
1. Does HBase store each table in A format:
"com.cnn.www", t6, "<html>...",
"com.cnn.www", t5, "<html>...",
"com.cnn.www", t3, "<html>...",
"com.sohu.www", t8, "<html>..."
"com.sohu.www", t7, "<html>..."
or B fomat:
"com.cnn.www", t6, "<html>...",
t5, "<html>...",
t3, "<html>...",
"com.sohu.www", t8, "<html>...",
t7, "<html>..."
A format treat RowKey and TimeStamp as key, and wastes space of the
RowKey "com.cnn.www" or "com.sohu.www"several times.
While B format treat RowKey as key, and TimeStamp and Column as
attributes. And each row doesn't maintain the same format.
2. Another question, maybe we will get several labels in the same
family at the same time. For example, we will crawl a web page at time
t1, and the page contains 2 anchors, one is a.com, the other is b.com.
How to store it in hbase?
"com.cnn.www", t1, "anchor:a.com", "aaa",
"com.cnn.www", t1, "anchor:b.com", "bbb",
"com.cnn.www", t2, "anchor:c.com", "ccc"
or
"com.cnn.www", t1, "anchor:a.com", "aaa",
"anchor:b.com", "bbb",
"com.cnn.www", t2, "anchor:c.com", "ccc"
thanks!
Bin YANG
--
Bin YANG
Department of Computer Science and Engineering
Fudan University
Shanghai, P. R. China
EMail: [EMAIL PROTECTED]