[jira] [Resolved] (ACCUMULO-543) why are we still reusing RFile.Readers?

Christopher Tubbs (Jira) Wed, 02 Nov 2022 12:07:07 -0700


     [ 
https://issues.apache.org/jira/browse/ACCUMULO-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christopher Tubbs resolved ACCUMULO-543.
----------------------------------------
    Resolution: Abandoned

Closing this stale issue. If this is still a problem, please create a new issue 
or PR at https://github.com/apache/accumulo

> why are we still reusing RFile.Readers?
> ---------------------------------------
>
>                 Key: ACCUMULO-543
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-543
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>            Priority: Major
>
> The TabletServer includes a bunch of code to manage reuse of RFile.Reader 
> objects across scans. There are a few reasons for doing this that certainly 
> made sense in the past, including:
> * Reducing index reads. If we have already found a location in the RFile 
> index, and we can easily determine that this is the location that we need for 
> the next read, then we can skip doing a binary search in the RFile index. 
> Index caching and hierarchical indexes may improve this aspect to where we 
> don't need this optimization.
> * Reducing re-reading of the same data block. Re-reading a block used to mean 
> that HDFS would create a new socket rather than reusing the current socket. 
> Seeking to the same spot HDFS now reuses connections, so this is not as big a 
> deal.
> * Reducing namenode operations. Re-opening an RFile requires knowledge of the 
> RFile's size. Even if we have the block locations cached, the file size still 
> includes an HDFS namenode operation. This is something we should probably be 
> caching within the TabletServer, since our RFiles are immutable.
> There are also a few problems with reusing RFiles, including:
> * Increased TabletServer complexity. We need to manage the resource pool 
> among many scans. Not reusing readers along with ACCUMULO-416 may make it 
> possible to eliminate coordination of RFile.Readers at the TabletServer level 
> altogether.
> * Risk of information leakage. RFile.Readers hold state, and we don't want 
> that state passing between scan sessions. Sharing RFile.Readers makes it more 
> likely that a bug in this class exists.
> * Risk of interference. RFile.Readers are not thread safe now, but even if 
> they were we wouldn't want separate scan sessions (or maybe even multiple 
> threads in the same scan session) affecting each other through the 
> RFile.Reader. Sharing RFile.Readers also makes it more likely that a bug in 
> this class exists.
> The relevant question is what would be the effect on creating a new 
> RFile.Reader anytime we want to read an RFile instead of grabbing one from 
> the pool? ACCUMULO-416 is certainly a prerequisite for this ticket, and we 
> know we need to cache the RFile length. Are there any other prerequisites?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ACCUMULO-543) why are we still reusing RFile.Readers?

Reply via email to