[ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193889#comment-15193889 ]
ASF GitHub Bot commented on ACCUMULO-4164: ------------------------------------------ GitHub user keith-turner opened a pull request: https://github.com/apache/accumulo/pull/80 ACCUMULO-4164 Avoid copying rfile index when in cache. Avoid sync wh… …en deserializing index. You can merge this pull request into a Git repository by running: $ git pull https://github.com/keith-turner/accumulo rfile-no-index-copy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 ---- commit c86ec0b2627a7660d13a3a01a73573b12423fd9c Author: Keith Turner <ktur...@apache.org> Date: 2016-03-12T00:38:38Z ACCUMULO-4164 Avoid copying rfile index when in cache. Avoid sync when deserializing index. ---- > Avoid copy of RFile Index blocks when in cache > ---------------------------------------------- > > Key: ACCUMULO-4164 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4164 > Project: Accumulo > Issue Type: Improvement > Affects Versions: 1.6.5, 1.7.1 > Reporter: Keith Turner > Assignee: Keith Turner > Fix For: 1.6.6, 1.7.2, 1.8.0 > > > I have been doing performance experiments with RFile. During the course of > these experiments I noticed that RFile is not as fast at it should be in the > case where index blocks are in cache and the RFile is not already open. The > reason is that the RFile code copies and deserializes the index data even > though its already in memory. > I made the following change to RFile in a branch. > * Avoid copy of index data when its in cache > * Deserialize offsets lazily (instead of upfront) during binary search > * Stopped calling lots of synchronized methods during deserialization of > index info. The existing code use ByteArrayInputStream which results in lots > of fine grained synchronization. Switching to an inputstream that offers the > same functionality w/o sync showed a measurable performance difference. > These changes lead to performance in the following two situations : > * When an RFiles data is in cache, but its not open on the tserver. > * For RFiles with multilevel indexes with index data in cache. Currently > an open RFile only keeps the root node in memory. Lower level index nodes > are always read from the cache or DFS. The changes I made would always > avoid the copy and deserialization of lower level index nodes when in cache. > I have seen significant performance improvements testing with the two cases > above. My test are currently based on a new API I am creating for RFile, so > I can not easily share them until I get that pushed. > For the case where a tserver has all files frequently in use already open and > those files have a single level index, these changes should not make a > significant performance difference. > These change should result in less memory use for opening the same rfile > multiple times for different scans (when data is in cache). In this case all > of the RFiles would share the same byte array holding the serialized index > data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)