cshannon commented on issue #1327: URL: https://github.com/apache/accumulo/issues/1327#issuecomment-1427102074
@ctubbsii, @keith-turner, @dlmarion - As a first step to this issue I have been looking into what it would take to create an iterator to read an Rfile that was fenced off by a range or ranges and wanted to get some feedback here on what approach to proceed with as I have come across some issues/concerns with each approach I've looked at. Below are the main 2 ideas I've looked into so far. #### 1. We could create a new RFile reader/iterator (for the purposes of this can just call it FencedRFileReader) that can handle multiple ranges to fence what's returned. The idea here is the new FencedRFileReader iterator would take an existing RFile reader as the source and also a list of 1 or more ranges (or no ranges to mean whole file) and then handle transparently iterating, seeking, etc over the file and skipping rows not in a range. There are a couple ways that I thought of to do this: - One way is to have FencedRFileReader extend [SeekingFilter](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/iterators/user/SeekingFilter.java) or something similar and then internally the iterator can just handle advancing between rows and ranges by overriding `getNextKeyHint()`. Essentially it would keep track of the current range and then handle seeking to the next range transparently when the each range is exhausted when calling `next()`. It of course would need to appropriately handle the other methods as well. - Another option is is for FencedRFileReader to extend [HeapIterator](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/iteratorsImpl/system/HeapIterator.java) like the original RFile [reader](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L1160) does. The idea I had here is to first create separate RFile readers for each range to fence off an RFile by a single range (ie RangedRFileReader). Then the new FencedRFileReader could add each single RangedRFileReader as a source and since it's a HeapIterator it should handle things automatically across multiple sources. A problem I see with this approach of an iterator that handles more than one range is I'm not sure it would work easily because of having to handle implementing methods in [FileSKVIterator](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/FileSKVIterator.java) such as `getFirstKey()` and `getLastKey()`. There are places in the code like [CompactableUtils](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java#L111) that use getFirstKey()/getLastKey() and rely on them for decisions and I am not sure if this would break now that there would be multiple ranges. I would need to dig into this more but maybe someone else can comment on this who knows more about the use cases for those methods. I also wonder how things like [Sampling](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff 65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/FileSKVIterator.java) would work. A plus side is this approach lends itself easily to still just using a single file entry per file in the metadata table and we can just extend the value to also contain a list of ranges for the file. #### 2. A second approach could be to only create an RFile reader/iterator that handles a single range and return a new reader for each range when using FileManager to [open](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/server/base/src/main/java/org/apache/accumulo/server/fs/FileManager.java#L503) the list of RFiles. So instead of having a single iterator per RFile that handles multiple ranges like described in approach 1, we would just return multiple RFile readers, one for every range specified. The problem I see here is all of the places where we use a list of files each file is only uniquely identified by [TableFile](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L141)/StoredTableFile and the key is just the Path of the file. This would obviously not work in this case because we'd have the path duplicated now as there would be multiple readers for the same file. We'd need to update those classes or have a new class to also add a range as well as Path to uniquely identify the file. Updating TableFile to optionally take a Range in addition to the Path and use that for comparison/equality may be good enough here. A plus side is methods in FileSKVIterator such as getFirstKey()/getLastKey() may be easy to implement as it's just a subset of the file and since each reader handle a single continous range hopefully code that calls those methods would work without modification (or without as much modification). Also I was thinking this approach may be easier if there was a separate file entry in the metadata table for each range that is associated with an Rfile instead of just a single entry like today and we just update StoredTableFile to take a Range as part of the comparison/equality but we could probably still make it work with a single entry. Thoughts/Comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
