[ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202925#comment-15202925 ]
ASF GitHub Bot commented on ACCUMULO-4164: ------------------------------------------ Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/80#discussion_r56753657 --- Diff: core/src/main/java/org/apache/accumulo/core/file/blockfile/impl/SeekableByteArrayInputStream.java --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.file.blockfile.impl; + +import java.io.IOException; +import java.io.InputStream; + +/** + * This class is like byte array input stream with two differences. It supports seeking and avoids synchronization. + */ +public class SeekableByteArrayInputStream extends InputStream { + + // make this volatile to ensure data set by one thread can be seen by another + private volatile byte buffer[]; + private int cur; + private int max; + + @Override + public int read() { + if (cur < max) { + return buffer[cur++] & 0xff; + } else { + return -1; + } + } + + @Override + public int read(byte b[], int offset, int length) { + if (b == null) { + throw new NullPointerException(); + } + + if (length < 0 || offset < 0 || length > b.length - offset) { + throw new IndexOutOfBoundsException(); + } + + if (length == 0) { + return 0; + } + + int avail = max - cur; + + if (avail <= 0) { + return -1; + } + + if (length > avail) { + length = avail; + } + + System.arraycopy(buffer, cur, b, offset, length); + cur += length; + return length; + } + + @Override + public long skip(long requestedSkip) { + long actualSkip = max - cur; + if (requestedSkip < actualSkip) + if (requestedSkip < 0) + actualSkip = 0; + else + actualSkip = requestedSkip; + + cur += actualSkip; + return actualSkip; + } + + @Override + public int available() { + return max - cur; + } + + @Override + public boolean markSupported() { + return false; + } + + @Override + public void mark(int readAheadLimit) { + throw new UnsupportedOperationException(); + } + + @Override + public void reset() { + throw new UnsupportedOperationException(); + } + + @Override + public void close() throws IOException {} + + public SeekableByteArrayInputStream(byte[] buf) { + this.buffer = buf; + this.cur = 0; + this.max = buf.length; + } + + public SeekableByteArrayInputStream(byte[] buf, int maxOffset) { + this.buffer = buf; --- End diff -- `Objects.requireNonNull(buf)` > Avoid copy of RFile Index blocks when in cache > ---------------------------------------------- > > Key: ACCUMULO-4164 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4164 > Project: Accumulo > Issue Type: Improvement > Affects Versions: 1.6.5, 1.7.1 > Reporter: Keith Turner > Assignee: Keith Turner > Fix For: 1.6.6, 1.7.2, 1.8.0 > > > I have been doing performance experiments with RFile. During the course of > these experiments I noticed that RFile is not as fast at it should be in the > case where index blocks are in cache and the RFile is not already open. The > reason is that the RFile code copies and deserializes the index data even > though its already in memory. > I made the following change to RFile in a branch. > * Avoid copy of index data when its in cache > * Deserialize offsets lazily (instead of upfront) during binary search > * Stopped calling lots of synchronized methods during deserialization of > index info. The existing code use ByteArrayInputStream which results in lots > of fine grained synchronization. Switching to an inputstream that offers the > same functionality w/o sync showed a measurable performance difference. > These changes lead to performance in the following two situations : > * When an RFiles data is in cache, but its not open on the tserver. > * For RFiles with multilevel indexes with index data in cache. Currently > an open RFile only keeps the root node in memory. Lower level index nodes > are always read from the cache or DFS. The changes I made would always > avoid the copy and deserialization of lower level index nodes when in cache. > I have seen significant performance improvements testing with the two cases > above. My test are currently based on a new API I am creating for RFile, so > I can not easily share them until I get that pushed. > For the case where a tserver has all files frequently in use already open and > those files have a single level index, these changes should not make a > significant performance difference. > These change should result in less memory use for opening the same rfile > multiple times for different scans (when data is in cache). In this case all > of the RFiles would share the same byte array holding the serialized index > data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)