[
https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193889#comment-15193889
]
ASF GitHub Bot commented on ACCUMULO-4164:
------------------------------------------
GitHub user keith-turner opened a pull request:
https://github.com/apache/accumulo/pull/80
ACCUMULO-4164 Avoid copying rfile index when in cache. Avoid sync wh…
…en deserializing index.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/keith-turner/accumulo rfile-no-index-copy
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/accumulo/pull/80.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #80
----
commit c86ec0b2627a7660d13a3a01a73573b12423fd9c
Author: Keith Turner <[email protected]>
Date: 2016-03-12T00:38:38Z
ACCUMULO-4164 Avoid copying rfile index when in cache. Avoid sync when
deserializing index.
----
> Avoid copy of RFile Index blocks when in cache
> ----------------------------------------------
>
> Key: ACCUMULO-4164
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4164
> Project: Accumulo
> Issue Type: Improvement
> Affects Versions: 1.6.5, 1.7.1
> Reporter: Keith Turner
> Assignee: Keith Turner
> Fix For: 1.6.6, 1.7.2, 1.8.0
>
>
> I have been doing performance experiments with RFile. During the course of
> these experiments I noticed that RFile is not as fast at it should be in the
> case where index blocks are in cache and the RFile is not already open. The
> reason is that the RFile code copies and deserializes the index data even
> though its already in memory.
> I made the following change to RFile in a branch.
> * Avoid copy of index data when its in cache
> * Deserialize offsets lazily (instead of upfront) during binary search
> * Stopped calling lots of synchronized methods during deserialization of
> index info. The existing code use ByteArrayInputStream which results in lots
> of fine grained synchronization. Switching to an inputstream that offers the
> same functionality w/o sync showed a measurable performance difference.
> These changes lead to performance in the following two situations :
> * When an RFiles data is in cache, but its not open on the tserver.
> * For RFiles with multilevel indexes with index data in cache. Currently
> an open RFile only keeps the root node in memory. Lower level index nodes
> are always read from the cache or DFS. The changes I made would always
> avoid the copy and deserialization of lower level index nodes when in cache.
> I have seen significant performance improvements testing with the two cases
> above. My test are currently based on a new API I am creating for RFile, so
> I can not easily share them until I get that pushed.
> For the case where a tserver has all files frequently in use already open and
> those files have a single level index, these changes should not make a
> significant performance difference.
> These change should result in less memory use for opening the same rfile
> multiple times for different scans (when data is in cache). In this case all
> of the RFiles would share the same byte array holding the serialized index
> data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)