Lars George wrote:
Hi,
I was wondering if there is a low cost (as in memory) and fast way to
check if a certain cell already exists? I need to insert a cell, but
based on if it was there before or not increase a counter (as in total
number of entries in a table).
Does the count of elements have to be up-to-date? Why not just scan the
table every hour or so to get a count? (Scans are fast in 0.19.0. Seven
times faster than they were in 0.17.x and probably 100 times faster than
what they are in 0.1.3 -- smile).
I see that HTable.get(...) returns the byte array, means there are
memory, reading and network streaming involved.
Yes.
So if I do a
if (table.get(row, col) == null) { incr(counter); }
table.put(...);
this seems like a waste of resources and may not be as fast as a true
if (!table.exists(row, col)) { incr(counter); }
table.put(...)
Its tough. Ideal would be a bloom filter on the column. You'd check
for presence of a Cell in bloom filter. It'd come back yes/no. Would
be an in-memory test but would involve a network trip (Maybe have a
client-side bloomfilter too? So, if exists, would save the network trip?).
The hard part about bloom filter though is that you would have specify
exact coordinates as in exact row/column/timestamp. The row/column part
is easy but the timestamp less-so. When you insert, you probably do not
specify a timestamp letting the system set the timestamp to now. If you
then want to test existence in a bloomfilter, how you going to do it if
you don't have the exact timestamp. So, you end up using the hbase
get(row, column) because it will return the latest insert if no
timestamp specified.
Otherwise, looks like you would be happy with a bloomfilter that just
recorded the row and column and not timestamp. That'd work. I think
this is how bloomfilters work now in latest hbase. We need to check.
They used to be row/column/timestamp (They are broken till we release
0.19.0 though -- in about a month).
It looks like this is easily doable since get() also delegates to the
region servers.
Am I missing something? Assuming HTable is sort of a Set
implementation I am confused as to way this check is missing.
Well, its not that straight-forward. The only place to check presence
of a column is by actually asking hbase and letting it check its
memcache and then all of its storefiles. This is only way to see if a
row/column combination exists. There is no short-circuit, say, a Set
that holds all row/column combinations because it could be massive if a
row had millions of columns (nothing to prevent this happening).
Is the lookup taking too long? In 0.19.0, the speeds are all up. There
is a cache of file blocks maintained in the server. If you can hit the
cache, then you can see lookup rates double and even quadruple.
Would a bloom filter in your client work help?
St.Ack