[
https://issues.apache.org/jira/browse/ACCUMULO-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher Tubbs resolved ACCUMULO-543.
----------------------------------------
Resolution: Abandoned
Closing this stale issue. If this is still a problem, please create a new issue
or PR at https://github.com/apache/accumulo
> why are we still reusing RFile.Readers?
> ---------------------------------------
>
> Key: ACCUMULO-543
> URL: https://issues.apache.org/jira/browse/ACCUMULO-543
> Project: Accumulo
> Issue Type: Improvement
> Components: tserver
> Reporter: Adam Fuchs
> Assignee: Adam Fuchs
> Priority: Major
>
> The TabletServer includes a bunch of code to manage reuse of RFile.Reader
> objects across scans. There are a few reasons for doing this that certainly
> made sense in the past, including:
> * Reducing index reads. If we have already found a location in the RFile
> index, and we can easily determine that this is the location that we need for
> the next read, then we can skip doing a binary search in the RFile index.
> Index caching and hierarchical indexes may improve this aspect to where we
> don't need this optimization.
> * Reducing re-reading of the same data block. Re-reading a block used to mean
> that HDFS would create a new socket rather than reusing the current socket.
> Seeking to the same spot HDFS now reuses connections, so this is not as big a
> deal.
> * Reducing namenode operations. Re-opening an RFile requires knowledge of the
> RFile's size. Even if we have the block locations cached, the file size still
> includes an HDFS namenode operation. This is something we should probably be
> caching within the TabletServer, since our RFiles are immutable.
> There are also a few problems with reusing RFiles, including:
> * Increased TabletServer complexity. We need to manage the resource pool
> among many scans. Not reusing readers along with ACCUMULO-416 may make it
> possible to eliminate coordination of RFile.Readers at the TabletServer level
> altogether.
> * Risk of information leakage. RFile.Readers hold state, and we don't want
> that state passing between scan sessions. Sharing RFile.Readers makes it more
> likely that a bug in this class exists.
> * Risk of interference. RFile.Readers are not thread safe now, but even if
> they were we wouldn't want separate scan sessions (or maybe even multiple
> threads in the same scan session) affecting each other through the
> RFile.Reader. Sharing RFile.Readers also makes it more likely that a bug in
> this class exists.
> The relevant question is what would be the effect on creating a new
> RFile.Reader anytime we want to read an RFile instead of grabbing one from
> the pool? ACCUMULO-416 is certainly a prerequisite for this ticket, and we
> know we need to cache the RFile length. Are there any other prerequisites?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)