[
https://issues.apache.org/jira/browse/HBASE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704271#action_12704271
]
Jonathan Gray commented on HBASE-1249:
--------------------------------------
(this was written together with erik holstad)
@Stack
We don't have any profiling stats or end-to-end numbers at the moment, but we
do have are some early naive timing test that compare the 2 implementations and
already for very few inserts like 10 to 15 entries with no deletes in 2
storefiles + memcache I saw around 2x improvment. These test were only done on
the server and I have not tested for bigger tables and from the client, since I
only got the rpc to work the other day and we've done some reworks since then,
but it looks promising.
When we looked at the get code and how to make it faster we started to look
what structures were used and how many compares you have to do to get your
result. Because to me the way to speed things up is to cut down the total
number of compares that you need to do to get your result. Of course the total
fetch time is made up by a number of causes, but on the server, outside of
HFile/HDFS, where I think most time is spent is comparing entries to see if
they should be included in the result or not (this decision made by looking at
the query to see if it was asked for, comparing for deletes, checking for ttl,
maxVersions requested, timeRange, etc).
So what I saw when looking at the old code is that for every new entry from the
storefile you compare it to all entries in the get set and then for every match
between those you have to check to see it is has been deleted.
Let {{*k*}} be the number KeyValues we will look at. Let {{*l*}} be the number
of columns asked for by the client.
Old Way: We compare every entry from the storefile (k) with every entry from
the getset (l) which will give you {{*k * l*}} compares in total.
If the getset is sorted (storefile/memcache always is), then you are just doing
a sorted merge between two sorted lists.
New Way: We do a sorted merge between the entries in the storefile (k) with the
entries in the sorted getset (l) which will give you {{*k + l*}} compares in
total.
The difference between {{*k * l*}} and {{*k + l*}} can be significant if you
are matching more than one column.
Notes:
- current plan is to build the sortedset behind the scenes on the client in the
Get
- this insertion sort will have a negligible impact on clients and no
additional work will be required server-side
- we will send the sortedset over the wire as a list. in that case, it will
not have to be re-sorted.
(this last optimization is important. right now we do a LOT of serializing a
sorted tree, sending it over the wire, and then deserializing it. the
deserializing of trees actually rebuilds the entire tree, thus reperforming the
sort. using lists (which are sorted) will give us time and memory savings.)
On the delete part, we have two things to deal with. Checking current KVs
against the current DeleteSet, and adding newly seen deletes into the DeleteSet.
Let {{*k*}} be the number of KeyValues we will look at. Let {{*m*}} be the
number of deletes we have already seen. Let {{*n*}} be the number of new
deletes we will see.
Old way of checking current KVs agains the current DeleteSet:
- DeleteSet is a sortedSet, so it has {{*log(m)*}}
inserts/gets/deletes/contains operations.
- For each KV seen (k), we have a {{*log(m)*}} operation to see if it is
deleted.
- This will give you {{*k * log(m)*}} total comparisons.
Old way of inserting newly seen delete KVs into current DeleteSet:
- When we see a delete, we insert it to the deleteset in {{*log(m)*}}. This
has downsides.
- We're adding deletes which we know will not be relevant in this storefile
(increasing m increases subsequent delete checks so we have log(m + n)
eventually).
- We're doing work which we may not potentially ever have to do (merging newly
seen delete with previous set). If we are going to early-out of this storefile
for some reason, there would never have been a reason to pay anything but
constant-time cost for deletes related to older storefiles which will not end
up being opened.
- Trees are bad! Their allocation of lots of small nodes is a disaster for GC.
We should use lists everywhere we can.
- This gives you {{*n * log(m)*}} total comparisons.
The new approach is similiar to how we now handle the get list. Deletes will
be stored in two lists. A currentDeleteList and a newDeleteList.
To start, the memcache has no deletes to look at. You will, however, add any
deletes seen to the newDeleteList (in constant time, no significant extra
work). Since memcache (and storefiles) are sorted, newDeleteList is sorted.
When moving to the next storefile, you will merge newDeleteList with
currentDeleteList.
New way of handling deletes:
- currentDeleteList is a sorted ArrayList of deletes. there is a DeleteFamily
special-case that just stores a stamp (which implies it's existence).
- We do a sorted merge of KVs seen (k) with the currentDeleteList (m), for {{*k
+ m*}} total comparisons.
- Each seen new delete is a O(1) append to the newDeleteList.
- At the end of the storefile, you will then need to merge the delete lists,
for {{*m + n*}} total comparisons.
- This will give you {{*(k + m) + (m + n)*}} total comparisons for each
storefile.
The difference in total is: Old = {{*(k + n) * log(m)*}} vs New = {{*(k + n) +
2m*}}
The algorithmic improvements seem clear. The replacement of Trees/SortedSets
with (Sorted)Lists/ArrayLists will also be beneficial for GC/heap usage.
Additional complexity in things like merging delete lists are not a big deal if
we make thorough unit tests (which we have already done in that case).
> Rearchitecting of server, client, API, key format, etc for 0.20
> ---------------------------------------------------------------
>
> Key: HBASE-1249
> URL: https://issues.apache.org/jira/browse/HBASE-1249
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Jonathan Gray
> Priority: Blocker
> Fix For: 0.20.0
>
> Attachments: HBASE-1249-Example-v1.pdf, HBASE-1249-Example-v2.pdf,
> HBASE-1249-GetQuery-v1.pdf, HBASE-1249-GetQuery-v2.pdf,
> HBASE-1249-GetQuery-v3.pdf, HBASE-1249-GetQuery-v4.pdf,
> HBASE-1249-StoreFile-v1.pdf, HBASE-1249-StoreFile-v4.pdf
>
>
> To discuss all the new and potential issues coming out of the change in key
> format (HBASE-1234): zero-copy reads, client binary protocol, update of API
> (HBASE-880), server optimizations, etc...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.