why are we still reusing RFile.Readers?
---------------------------------------
Key: ACCUMULO-543
URL: https://issues.apache.org/jira/browse/ACCUMULO-543
Project: Accumulo
Issue Type: Improvement
Components: tserver
Reporter: Adam Fuchs
Assignee: Adam Fuchs
The TabletServer includes a bunch of code to manage reuse of RFile.Reader
objects across scans. There are a few reasons for doing this that certainly
made sense in the past, including:
* Reducing index reads. If we have already found a location in the RFile index,
and we can easily determine that this is the location that we need for the next
read, then we can skip doing a binary search in the RFile index. Index caching
and hierarchical indexes may improve this aspect to where we don't need this
optimization.
* Reducing re-reading of the same data block. Re-reading a block used to mean
that HDFS would create a new socket rather than reusing the current socket.
Seeking to the same spot HDFS now reuses connections, so this is not as big a
deal.
* Reducing namenode operations. Re-opening an RFile requires knowledge of the
RFile's size. Even if we have the block locations cached, the file size still
includes an HDFS namenode operation. This is something we should probably be
caching within the TabletServer, since our RFiles are immutable.
There are also a few problems with reusing RFiles, including:
* Increased TabletServer complexity. We need to manage the resource pool among
many scans. Not reusing readers along with ACCUMULO-416 may make it possible to
eliminate coordination of RFile.Readers at the TabletServer level altogether.
* Risk of information leakage. RFile.Readers hold state, and we don't want that
state passing between scan sessions. Sharing RFile.Readers makes it more likely
that a bug in this class exists.
* Risk of interference. RFile.Readers are not thread safe now, but even if they
were we wouldn't want separate scan sessions (or maybe even multiple threads in
the same scan session) affecting each other through the RFile.Reader. Sharing
RFile.Readers also makes it more likely that a bug in this class exists.
The relevant question is what would be the effect on creating a new
RFile.Reader anytime we want to read an RFile instead of grabbing one from the
pool? ACCUMULO-416 is certainly a prerequisite for this ticket, and we know we
need to cache the RFile length. Are there any other prerequisites?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira