cshannon opened a new pull request, #3418:
URL: https://github.com/apache/accumulo/pull/3418

   This allows treating each fenced range of an RFile as a separate TabletFile 
for reading purposes. This PR is part of #1327 and the latest attempt to add 
fencing to an RFile. The changes here build off of the changes in #3401 by 
adding a Range to `AbstractTabletFile`. This allows `RFileOperations` to easily 
access the Range and wrap the Reader inside a FencedReader.
   
   The idea here is to associate a range/fence with a TabletFile so that we can 
easily treat the combination of an RFile and fence as a unique file which means 
less changes to the rest of the code base when we have multiple ranges for a 
single file as the code just thinks they are unique files. For more information 
see the comment 
[here](https://github.com/apache/accumulo/issues/1327#issuecomment-1509174746) 
and 
[here](https://github.com/apache/accumulo/issues/1327#issuecomment-1509338370).
   
   So, for example, if we had 5 ranges defined for an RFile we'd load up 5 
"files" that were fenced off by each range and the rest of the code would just 
get a list of 5 readers and wouldn't know that they were actually the same file 
and wouldn't care when iterating. The 5 fenced files (that are really just 
subsets of the same file) are treated identical by everywhere else in the code 
as 5 unique files. 
   
   One thing to note is that inside FileManager we track reserved readers by 
TabletFile so each unique range for the same file would get its own reader in 
the cache. This should be fine as we want to treat them as unique and actual 
file data on disk is still cached by the block cache and won't be duplicated if 
multiple ranges. We still want to limit the number iterators/scans at one time 
even if it's the same file. In fact, this isn't new as we already do this. 
FileManager previously already supported readers for the same file in case 
there are multiple concurrent reads, this just now also supports another way to 
have a reference to the same file.
   
   I marked this as a work in progress for now as I wasn't sure how much to do 
update in this PR vs future PRs. The main purpose of this PR is just to add the 
fencing iterator but I also updated FileOperations and RFileScanner to use it 
just to demonstrate it works.
   
   PR includes the following:
   
   1. An iterator to fence off an RFile by range
   2. An iterator to also fence off an RFile index
   3. There is a test class that demonstrates the fencing called FencedRFileTest
   4.  RFileScanner was updated so clients can also pass a range for an RFile. 
The matching classes (RFileScannerBuilder, etc) were updated as well. Two tests 
were added to demonstrate fencing in RFileClientTest. One demonstrates using 
the client Scanner and the other uses FileOperations. The FileOperations test 
probably belong somewhere else but this was mostly just to demonstrate it 
works. Note that a RFileScanner for clients already takes a range but that 
range is an overall range across multiple files where as this shows passing a 
range per RFile. Ultimately we may decide we don't need to fence RFileScanner 
but it demonstrates we can if we want to.
   
   There is more work to do in this PR and/or follow on PRs:
   1. Make sure all places that need to fence can read an RFile also pass in a 
range for the fenced iterator
   2. Add some tests to verify the changes in RFileOperations and FileManager 
work when opening a ranged file
   3. I will create a separate PR to handle writing the new DataFileValue 
metadata which will include adding a range to the CQ and storing a separate DFV 
for each combination of file and range. 
   4. After we can persist the ranges we need to update everywhere that uses a 
reader to be able to pass in ranges (compaction, scanners, etc) and it would be 
good to have some ITs to show metadata table changes can contain ranges and be 
read and fence off files
   5. Actually update the merge code to use all the changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to