[ 
https://issues.apache.org/jira/browse/HBASE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704271#action_12704271
 ] 

Jonathan Gray commented on HBASE-1249:
--------------------------------------

(this was written together with erik holstad)

@Stack

We don't have any profiling stats or end-to-end numbers at the moment, but we 
do have are some early naive timing test that compare the 2 implementations and 
already for very few inserts like 10 to 15 entries with no deletes in 2 
storefiles + memcache I saw around 2x improvment. These test were only done on 
the server and I have not tested for bigger tables and from the client, since I 
only got the rpc to work the other day and we've done some reworks since then, 
but it looks promising.

When we looked at the get code and how to make it faster we started to look 
what structures were used and how many compares you have to do to get your 
result. Because to me the way to speed things up is to cut down the total 
number of compares that you need to do to get your result. Of course the total 
fetch time is made up by a number of causes, but on the server, outside of 
HFile/HDFS, where I think most time is spent is comparing entries to see if 
they should be included in the result or not (this decision made by looking at 
the query to see if it was asked for, comparing for deletes, checking for ttl, 
maxVersions requested, timeRange, etc).


So what I saw when looking at the old code is that for every new entry from the 
storefile you compare it to all entries in the get set and then for every match 
between those you have to check to see it is has been deleted.

Let {{*k*}} be the number KeyValues we will look at.  Let {{*l*}} be the number 
of columns asked for by the client.

Old Way: We compare every entry from the storefile (k) with every entry from 
the getset (l) which will give you {{*k * l*}} compares in total.

If the getset is sorted (storefile/memcache always is), then you are just doing 
a sorted merge between two sorted lists.

New Way: We do a sorted merge between the entries in the storefile (k) with the 
entries in the sorted getset (l) which will give you {{*k + l*}} compares in 
total.

The difference between {{*k * l*}} and {{*k + l*}} can be significant if you 
are matching more than one column.

Notes:
- current plan is to build the sortedset behind the scenes on the client in the 
Get
- this insertion sort will have a negligible impact on clients and no 
additional work will be required server-side
- we will send the sortedset over the wire as a list.  in that case, it will 
not have to be re-sorted.
(this last optimization is important.  right now we do a LOT of serializing a 
sorted tree, sending it over the wire, and then deserializing it.  the 
deserializing of trees actually rebuilds the entire tree, thus reperforming the 
sort.  using lists (which are sorted) will give us time and memory savings.)


On the delete part, we have two things to deal with.  Checking current KVs 
against the current DeleteSet, and adding newly seen deletes into the DeleteSet.

Let {{*k*}} be the number of KeyValues we will look at.  Let {{*m*}} be the 
number of deletes we have already seen.  Let {{*n*}} be the number of new 
deletes we will see.

Old way of checking current KVs agains the current DeleteSet:
- DeleteSet is a sortedSet, so it has {{*log(m)*}} 
inserts/gets/deletes/contains operations.
- For each KV seen (k), we have a {{*log(m)*}} operation to see if it is 
deleted.
- This will give you {{*k * log(m)*}} total comparisons.

Old way of inserting newly seen delete KVs into current DeleteSet:
- When we see a delete, we insert it to the deleteset in {{*log(m)*}}.  This 
has downsides.
- We're adding deletes which we know will not be relevant in this storefile 
(increasing m increases subsequent delete checks so we have log(m + n) 
eventually).
- We're doing work which we may not potentially ever have to do (merging newly 
seen delete with previous set).  If we are going to early-out of this storefile 
for some reason, there would never have been a reason to pay anything but 
constant-time cost for deletes related to older storefiles which will not end 
up being opened.
- Trees are bad!  Their allocation of lots of small nodes is a disaster for GC. 
 We should use lists everywhere we can.
- This gives you {{*n * log(m)*}} total comparisons.

The new approach is similiar to how we now handle the get list.  Deletes will 
be stored in two lists.  A currentDeleteList and a newDeleteList.

To start, the memcache has no deletes to look at.  You will, however, add any 
deletes seen to the newDeleteList (in constant time, no significant extra 
work).  Since memcache (and storefiles) are sorted, newDeleteList is sorted.

When moving to the next storefile, you will merge newDeleteList with 
currentDeleteList.

New way of handling deletes:
- currentDeleteList is a sorted ArrayList of deletes.  there is a DeleteFamily 
special-case that just stores a stamp (which implies it's existence).
- We do a sorted merge of KVs seen (k) with the currentDeleteList (m), for {{*k 
+ m*}} total comparisons.
- Each seen new delete is a O(1) append to the newDeleteList.
- At the end of the storefile, you will then need to merge the delete lists, 
for {{*m + n*}} total comparisons.
- This will give you {{*(k + m) + (m + n)*}} total comparisons for each 
storefile.

The difference in total is: Old = {{*(k + n) * log(m)*}} vs New = {{*(k + n) + 
2m*}}


The algorithmic improvements seem clear.  The replacement of Trees/SortedSets 
with (Sorted)Lists/ArrayLists will also be beneficial for GC/heap usage.

Additional complexity in things like merging delete lists are not a big deal if 
we make thorough unit tests (which we have already done in that case).

> Rearchitecting of server, client, API, key format, etc for 0.20
> ---------------------------------------------------------------
>
>                 Key: HBASE-1249
>                 URL: https://issues.apache.org/jira/browse/HBASE-1249
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: Jonathan Gray
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1249-Example-v1.pdf, HBASE-1249-Example-v2.pdf, 
> HBASE-1249-GetQuery-v1.pdf, HBASE-1249-GetQuery-v2.pdf, 
> HBASE-1249-GetQuery-v3.pdf, HBASE-1249-GetQuery-v4.pdf, 
> HBASE-1249-StoreFile-v1.pdf, HBASE-1249-StoreFile-v4.pdf
>
>
> To discuss all the new and potential issues coming out of the change in key 
> format (HBASE-1234): zero-copy reads, client binary protocol, update of API 
> (HBASE-880), server optimizations, etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to