[ 
https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837281#action_12837281
 ] 

Yoram Kulbak commented on HBASE-2248:
-------------------------------------

Ryan:
The 4K quote is my mistake, based on a non-typical HBASE usage (small memstore, 
large KVs).
Cloning is definitely bad. It's only benefit is that it allows the scan to be 
isolated from on-going writes; HRegion#newScannerLock takes care of writes not 
coming in while the scanner is created, so 0.20.3 unlike 0.20.2 does provide 
protection from 'partial puts' if this was what you're implying by 'atomic 
protection'. There is also a test added to TestHRegion which verifies that. 

I'm not sure that rollback is a viable option:  
The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. 
tailMap incurs an O(log(n)) overhead when called on a ConcurrentSkipListMap so 
the total overhead of scanning the whole memstore in some cases, may be very 
close to the overhead of a complete sort of the KVs in memstore.
The 0.20.2 MemStore and MemStoreScanner are also functionally incorrect since  
- The scanner may observe a 'partial put' (not atomically protected) 
- The scanner scans incorrectly when a snapshot exists    

since we observed a considerable 'single scan' performance improvement using 
the new MemStore implementation could the performance hit stem from increased 
GC overhead on multiple concurrent scans?   
Note that with 0.20.2 we observed that MemStoreScanner is running slower than 
StoreFileScanner..  

Is it possible to avoid both 'partial puts' and cloning by 'timestamping' 
memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and 
when a scanner is created it grabs the current timestamp so that it knows to 
ignore KVs which entered the store after its creation?  Should probably use a 
counter and not currentTimeMillis to ensure a clear-cut. 

------------
About those ~50 byte KVs, according to my calcs:
KeyLength: 4 bytes
ValueLength: 4 bytes
rowLength: 2 bytes
FamilyLength: 1 byte
TimeStamp: 8 bytes
Type: 1 byte

There are 20 bytes of overhead to start with.
Adding an average of 10 bytes for the column and qualifier brings it to 40 
bytes. 
This leaves 10 bytes (out of 50) for the row + value. Meaning 80% of the 
storage is overhead.
My point is that if ~50b KVs are the common use-case  some optimization needs 
to be made to the way things are stored.
Perhaps you meant 50b for row+value?



> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a 
> ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when 
> starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short 
> scans.  Some of our data repesent a time series.   The data is stored in time 
> series order, MR jobs often insert/update new data at the end of the series, 
> and queries usually have to pick up some or all of the series.  These are 
> often scans of 0-100 rows at a time.  To load one page, we'll observe about 
> 20 such scans being triggered concurrently, and they take 2 seconds to 
> complete.  Doing a thread dump of a region server shows many threads in 
> ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key 
> values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to