why are we still reusing RFile.Readers?
---------------------------------------

                 Key: ACCUMULO-543
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-543
             Project: Accumulo
          Issue Type: Improvement
          Components: tserver
            Reporter: Adam Fuchs
            Assignee: Adam Fuchs


The TabletServer includes a bunch of code to manage reuse of RFile.Reader 
objects across scans. There are a few reasons for doing this that certainly 
made sense in the past, including:
* Reducing index reads. If we have already found a location in the RFile index, 
and we can easily determine that this is the location that we need for the next 
read, then we can skip doing a binary search in the RFile index. Index caching 
and hierarchical indexes may improve this aspect to where we don't need this 
optimization.
* Reducing re-reading of the same data block. Re-reading a block used to mean 
that HDFS would create a new socket rather than reusing the current socket. 
Seeking to the same spot HDFS now reuses connections, so this is not as big a 
deal.
* Reducing namenode operations. Re-opening an RFile requires knowledge of the 
RFile's size. Even if we have the block locations cached, the file size still 
includes an HDFS namenode operation. This is something we should probably be 
caching within the TabletServer, since our RFiles are immutable.

There are also a few problems with reusing RFiles, including:
* Increased TabletServer complexity. We need to manage the resource pool among 
many scans. Not reusing readers along with ACCUMULO-416 may make it possible to 
eliminate coordination of RFile.Readers at the TabletServer level altogether.
* Risk of information leakage. RFile.Readers hold state, and we don't want that 
state passing between scan sessions. Sharing RFile.Readers makes it more likely 
that a bug in this class exists.
* Risk of interference. RFile.Readers are not thread safe now, but even if they 
were we wouldn't want separate scan sessions (or maybe even multiple threads in 
the same scan session) affecting each other through the RFile.Reader. Sharing 
RFile.Readers also makes it more likely that a bug in this class exists.

The relevant question is what would be the effect on creating a new 
RFile.Reader anytime we want to read an RFile instead of grabbing one from the 
pool? ACCUMULO-416 is certainly a prerequisite for this ticket, and we know we 
need to cache the RFile length. Are there any other prerequisites?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to