[ 
https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193889#comment-15193889
 ] 

ASF GitHub Bot commented on ACCUMULO-4164:
------------------------------------------

GitHub user keith-turner opened a pull request:

    https://github.com/apache/accumulo/pull/80

    ACCUMULO-4164 Avoid copying rfile index when in cache.  Avoid sync wh…

    …en deserializing index.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keith-turner/accumulo rfile-no-index-copy

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/80.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #80
    
----
commit c86ec0b2627a7660d13a3a01a73573b12423fd9c
Author: Keith Turner <ktur...@apache.org>
Date:   2016-03-12T00:38:38Z

    ACCUMULO-4164 Avoid copying rfile index when in cache.  Avoid sync when 
deserializing index.

----


> Avoid copy of RFile Index blocks when in cache
> ----------------------------------------------
>
>                 Key: ACCUMULO-4164
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4164
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.6.6, 1.7.2, 1.8.0
>
>
> I have been doing performance experiments with RFile.  During the course of 
> these experiments I noticed that RFile is not as fast at it should be in the 
> case where index blocks are in cache and the RFile is not already open.  The 
> reason is that the RFile code copies and deserializes the index data even 
> though its already in memory.
> I made the following change to RFile in a branch.
>  * Avoid copy of index data when its in cache
>  * Deserialize offsets lazily (instead of upfront) during binary search
>  * Stopped calling lots of synchronized methods during deserialization of 
> index info.  The existing code use ByteArrayInputStream which results in lots 
> of fine grained synchronization.  Switching to an inputstream that offers the 
> same functionality w/o sync showed a measurable performance difference.  
> These changes lead to performance in the following two situations  :
>  * When an RFiles data is in cache, but its not open on the tserver.  
>  * For RFiles with multilevel indexes with index data in cache.   Currently 
> an open RFile only keeps the root node in memory.   Lower level index nodes 
> are always read from the cache or DFS.   The changes I made would always 
> avoid the copy and deserialization of lower level index nodes when in cache.
> I have seen significant performance improvements testing with the two cases 
> above.  My test are currently based on a new API I am creating for RFile, so 
> I can not easily share them until I get that pushed.  
> For the case where a tserver has all files frequently in use already open and 
> those files have a single level index, these changes should not make a 
> significant performance difference.
> These change should result in less memory use for opening the same rfile 
> multiple times for different scans (when data is in cache).  In this case all 
> of the RFiles would share the same byte array holding the serialized index 
> data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to