Hello.
I made some tests with HBase 0.2.0 (RC2), focused on insertion and
timestamps behaviour. I had some surprising results, and I was wondering if
people using hbase already tried such an usage, and what was their
conclusion.
First of all I created a table with the default column attributes, using
hbase shell
## TABLE
hbase(main):008:0> describe 'proxy-0.2'
{NAME => 'proxy-0.2', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'status', BLOOMFILTER => '
false', IN_MEMORY => 'false', LENGTH => '2147483647', BLOCKCACHE => 'false',
VERSIONS => '3', TTL => '-1', COM
PRESSION => 'NONE'}, {NAME => 'header', BLOOMFILTER => 'false', IN_MEMORY =>
'false', LENGTH => '2147483647',
BLOCKCACHE => 'false', VERSIONS => '3', TTL => '-1', COMPRESSION => 'NONE'},
{NAME => 'bytes', BLOOMFILTER =>
'false', IN_MEMORY => 'false', LENGTH => '2147483647', BLOCKCACHE =>
'false', VERSIONS => '3', TTL => '-1', CO
MPRESSION => 'NONE'}, {NAME => 'info', BLOOMFILTER => 'false', IN_MEMORY =>
'false', LENGTH => '2147483647', B
LOCKCACHE => 'false', VERSIONS => '3', TTL => '-1', COMPRESSION => 'NONE'}]}
Test1
I make a loop that inserts the same row with different values at different
timestamps, arbitrary from 1000 incrementing from 10 to 10. I have a method
for dumping the row history: it makes a query for the last version, and
queries for past version using the current version timestamp minus 1. Note
that my table object is created once for entire program life cycle.
## GLOBAL CODE
// somewhere in constructor
t = new HTable(conf, TABLE_NAME);
/**
* Dump reversed history of a HBase row, querying for older version
* using the max timestamp of all cells -1 until there is no cell
returned
* @param rowKey
*/
private void dumpRowVersions(String rowKey) {
Logger.log.info("Versions or row : "+rowKey);
try {
// first query. The newest version of the row
RowResult rr = t.getRow(rowKey);
int version = 1;
long maxTs;
do {
maxTs = -1;
String line = "";
// go through all cells of the row
for (Map.Entry en : rr.entrySet()) {
long ts = en.getValue().getTimestamp();
maxTs = Math.max(maxTs, ts);
line += new String(en.getKey());
line += " => " + new
String(en.getValue().getValue());
line += " ["+ts+"], ";
}
// remove the last coma and space for smarter
output
if (line.length() > 0) {
line = line.substring(0,
line.length()-2);
}
// prefix result with a version counter and the
max timestamp
// found in the cells
line = "#"+version+" MXTS["+maxTs+"] "+line;
if (maxTs != -1) {
// there was resulting cell. Continue
iteration
Logger.log.info(line);
// get previous version
version++;
rr = t.getRow(rowKey, maxTs-1);
}
} while (maxTs != -1);
} catch (IOException ex) {
throw new IllegalStateException("Cannot fetch history
of row
"+rowKey,ex);
}
}
## LOOP CODE
long ts = 1000;
do {
// insert the testrow with a new timestamp
BatchUpdate bu = new BatchUpdate("testrow", ts);
bu.put("bytes:", ("valbytes ts
"+ts).getBytes());
bu.put("status:", ("valstat ts"+ts).getBytes());
t.commit(bu);
Logger.log.info("-- Inserted ts "+ts);
// dump row history
Thread.sleep(70);
dumpRowVersions("testrow");
// next iteration in two seconds
ts += 10;
Thread.sleep(2000);
} while (true);
## OUTPUT
> Connecting to hbase master...
> -- Inserted ts 1000
> Versions or row : testrow
> #1 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1010
> Versions or row : testrow
> #1 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #2 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1020
> Versions or row : testrow
> #1 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #2 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #3 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1030
> Versions or row : testrow
> #1 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #2 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #3 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #4 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1040
> Versions or row : testrow
> #1 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #2 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #3 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #4 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #5 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1050
> Versions or row : testrow
> #1 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #2 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #3 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #4 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #5 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #6 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1060
> Versions or row : testrow
> #1 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #2 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #3 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #4 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #5 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #6 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #7 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1070
> Versions or row : testrow
> #1 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #2 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #3 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #4 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #5 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #6 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #7 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #8 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1080
> Versions or row : testrow
> #1 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #2 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #3 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #4 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #5 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #6 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #7 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #8 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #9 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1090
> Versions or row : testrow
> #1 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #2 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #3 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #4 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #5 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #6 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #7 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #8 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #9 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #10 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1100
> Versions or row : testrow
> #1 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
> #2 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #3 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #4 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #5 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #6 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #7 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #8 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #9 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #10 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #11 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
Despite the VERSIONS parameter of the columns (3) it seems that all versions
are stored.
Question: is there some garbage collector process that removes the old
versions ? if yes, when does it take place ?
Test 2
A bit more surprising: I delete my row, using the delete-all command in
shell:
# SHELL
hbase(main):001:0> scan 'proxy-0.2'
ROW COLUMN+CELL
testrow column=bytes:, timestamp=1100, value=valbytes
ts 1100
testrow column=status:, timestamp=1100, value=valstat
ts1100
2 row(s) in 0.3560 seconds
hbase(main):002:0> deleteall 'proxy-0.2', 'testrow'
0 row(s) in 0.1050 seconds
hbase(main):003:0> scan 'proxy-0.2'
ROW COLUMN+CELL
0 row(s) in 0.2540 seconds
The table is now empty, and if I try to launch my dumpRowHistory() method,
the emptiness is confirmed. Ok. Now I launch my test 1 again. Restarting
from timestamp 1000:
# OUTPUT
> Connecting to hbase master...
> -- Inserted ts 1000
> Versions or row : testrow
> -- Inserted ts 1010
> Versions or row : testrow
> -- Inserted ts 1020
> Versions or row : testrow
> -- Inserted ts 1030
> Versions or row : testrow
> -- Inserted ts 1040
> Versions or row : testrow
> -- Inserted ts 1050
> Versions or row : testrow
> -- Inserted ts 1060
> Versions or row : testrow
> -- Inserted ts 1070
> Versions or row : testrow
It seems that the row are not inserted. Querying from shell:
# SHELL
hbase(main):004:0> scan 'proxy-0.2'
ROW COLUMN+CELL
0 row(s) in 0.2030 seconds
But, If I allow the program to make more iterations than the first time (ts
> 1100), the newest timestamps are taken in account. As if the table
remembers of the previous maximum value of the timestamp:
Relaunching the code of Test 1 :
# OUTPUT
> Connecting to hbase master...
> -- Inserted ts 1000
> Versions or row : testrow
> -- Inserted ts 1010
> Versions or row : testrow
> -- Inserted ts 1020
> Versions or row : testrow
> -- Inserted ts 1030
> Versions or row : testrow
> -- Inserted ts 1040
> Versions or row : testrow
> -- Inserted ts 1050
> Versions or row : testrow
> -- Inserted ts 1060
> Versions or row : testrow
> -- Inserted ts 1070
> Versions or row : testrow
> -- Inserted ts 1080
> Versions or row : testrow
> -- Inserted ts 1090
> Versions or row : testrow
> -- Inserted ts 1100
> Versions or row : testrow
> #1 MXTS[1100] bytes: => valbytes ts 1100 [1100]
> #2 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #3 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #4 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #5 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #6 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #7 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #8 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #9 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #10 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #11 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1110
> Versions or row : testrow
> #1 MXTS[1110] bytes: => valbytes ts 1110 [1110]
> #2 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
> #3 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #4 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #5 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #6 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #7 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #8 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #9 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #10 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #11 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #12 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1120
> Versions or row : testrow
> #1 MXTS[1120] bytes: => valbytes ts 1120 [1120]
> #2 MXTS[1110] bytes: => valbytes ts 1110 [1110], status: => valstat
ts1110 [1110]
> #3 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
> #4 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #5 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #6 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #7 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #8 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #9 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #10 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #11 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #12 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #13 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
> -- Inserted ts 1130
> Versions or row : testrow
> #1 MXTS[1130] bytes: => valbytes ts 1130 [1130]
> #2 MXTS[1120] bytes: => valbytes ts 1120 [1120], status: => valstat
ts1120 [1120]
> #3 MXTS[1110] bytes: => valbytes ts 1110 [1110], status: => valstat
ts1110 [1110]
> #4 MXTS[1100] bytes: => valbytes ts 1100 [1100], status: => valstat
ts1100 [1100]
> #5 MXTS[1090] bytes: => valbytes ts 1090 [1090], status: => valstat
ts1090 [1090]
> #6 MXTS[1080] bytes: => valbytes ts 1080 [1080], status: => valstat
ts1080 [1080]
> #7 MXTS[1070] bytes: => valbytes ts 1070 [1070], status: => valstat
ts1070 [1070]
> #8 MXTS[1060] bytes: => valbytes ts 1060 [1060], status: => valstat
ts1060 [1060]
> #9 MXTS[1050] bytes: => valbytes ts 1050 [1050], status: => valstat
ts1050 [1050]
> #10 MXTS[1040] bytes: => valbytes ts 1040 [1040], status: => valstat
ts1040 [1040]
> #11 MXTS[1030] bytes: => valbytes ts 1030 [1030], status: => valstat
ts1030 [1030]
> #12 MXTS[1020] bytes: => valbytes ts 1020 [1020], status: => valstat
ts1020 [1020]
> #13 MXTS[1010] bytes: => valbytes ts 1010 [1010], status: => valstat
ts1010 [1010]
> #14 MXTS[1000] bytes: => valbytes ts 1000 [1000], status: => valstat
ts1000 [1000]
Since the timestamp reachs a newest value, the row is inserted. Moreover,
the previous insertions appears !
Notice another problem: the last insertion is missing one cell: the
'status:' column.
Using shell to scan the table give the same result:
# SHELL
hbase(main):003:0> scan 'proxy-0.2'
ROW COLUMN+CELL
testrow column=bytes:, timestamp=1130, value=valbytes
ts 1130
Relauching hbase with the stop-hbase.sh / start-hbase.sh scripts yields to
another unexpected behaviour:
When I run the scan command in the shell, I have the same result than above:
# SHELL
hbase(main):001:0> scan 'proxy-0.2'
ROW COLUMN+CELL
testrow column=bytes:, timestamp=1130, value=valbytes
ts 1130
but if I launch the dumpRowHistory method it appears that most of history of
the status: column is lost.
Notice that I tried many times and I never had the same behaviour twice
here, sometime the other column is missing, or the row is entirely lost
giving no result at all.
# OUTPUT
> #1 MXTS[1130] bytes: => valbytes ts 1130 [1130]
> #2 MXTS[1120] bytes: => valbytes ts 1120 [1120], status: => valstat
ts1120 [1120]
> #3 MXTS[1110] bytes: => valbytes ts 1110 [1110]
> #4 MXTS[1100] bytes: => valbytes ts 1100 [1100]
> #5 MXTS[1090] bytes: => valbytes ts 1090 [1090]
> #6 MXTS[1080] bytes: => valbytes ts 1080 [1080]
> #7 MXTS[1070] bytes: => valbytes ts 1070 [1070]
> #8 MXTS[1060] bytes: => valbytes ts 1060 [1060]
> #9 MXTS[1050] bytes: => valbytes ts 1050 [1050]
> #10 MXTS[1040] bytes: => valbytes ts 1040 [1040]
> #11 MXTS[1030] bytes: => valbytes ts 1030 [1030]
> #12 MXTS[1020] bytes: => valbytes ts 1020 [1020]
> #13 MXTS[1010] bytes: => valbytes ts 1010 [1010]
> #14 MXTS[1000] bytes: => valbytes ts 1000 [1000]
I tried other tests, replacing only one column, using an existing timestamp
to modify one single value, inserting past values, and so on... My
conclusion is either I don't understand the general behaviour of that, or I
make a bad usage of the API.
However, using normal insertion and normal query (I mean without any
timestamp) gives me coherent and predictable results. As well as normal
insertion and querying with past timestamps does.
Thanks for your work, and if someone has more information about timestamps
and designed behaviour, I'm very interested in it.
Have a nice day.
--
-- Jean-Adrien
--
View this message in context:
http://www.nabble.com/Insertion-and-timestamps-test-tp18890143p18890143.html
Sent from the HBase User mailing list archive at Nabble.com.