[GitHub] [accumulo] cshannon commented on issue #1327: No-chop merges

via GitHub Sun, 12 Feb 2023 10:37:43 -0800


cshannon commented on issue #1327:
URL: https://github.com/apache/accumulo/issues/1327#issuecomment-1427102074


   @ctubbsii, @keith-turner, @dlmarion  - As a first step to this issue I have 
been looking into what it would take to create an iterator to read an Rfile 
that was fenced off by a range or ranges and wanted to get some feedback here 
on what approach to proceed with as I have come across some issues/concerns 
with each approach I've looked at. Below are the main 2 ideas I've looked into 
so far.
   
   #### 1. We could create a new RFile reader/iterator (for the purposes of 
this can just call it FencedRFileReader) that can handle multiple ranges to 
fence what's returned.
   
   The idea here is the new FencedRFileReader iterator would take an existing 
RFile reader as the source and also a list of 1 or more ranges (or no ranges to 
mean whole file) and then handle transparently iterating, seeking, etc over the 
file and skipping rows not in a range. There are a couple ways that I thought 
of to do this:
   
   - One way is to have FencedRFileReader extend 
[SeekingFilter](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/iterators/user/SeekingFilter.java)
 or something similar and then internally the iterator can just handle 
advancing between rows and ranges by overriding `getNextKeyHint()`. Essentially 
it would keep track of the current range and then handle seeking to the next 
range transparently when the each range is exhausted when calling `next()`. It 
of course would need to appropriately handle the other methods as well.
   
   - Another option is is for FencedRFileReader to extend 
[HeapIterator](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/iteratorsImpl/system/HeapIterator.java)
 like the original RFile 
[reader](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L1160)
 does. The idea I had here is to first create separate RFile readers for each 
range to fence off an RFile by a single range (ie RangedRFileReader). Then the 
new FencedRFileReader could add each single RangedRFileReader as a source and 
since it's a HeapIterator it should handle things automatically across multiple 
sources.
   
   A problem I see with this approach of an iterator that handles more than one 
range is I'm not sure it would work easily because of having to handle 
implementing methods in 
[FileSKVIterator](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/FileSKVIterator.java)
 such as `getFirstKey()` and `getLastKey()`. There are places in the code like 
[CompactableUtils](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java#L111)
 that use getFirstKey()/getLastKey() and rely on them for decisions and I am 
not sure if this would break now that there would be multiple ranges. I would 
need to dig into this more but maybe someone else can comment on this who knows 
more about the use cases for those methods. I also wonder how things like 
[Sampling](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff
 
65e718e5819be229/core/src/main/java/org/apache/accumulo/core/file/FileSKVIterator.java)
 would work.
   
   A plus side is this approach lends itself easily to still just using a 
single file entry per file in the metadata table and we can just extend the 
value to also contain a list of ranges for the file.
   
   #### 2. A second approach could be to only create an RFile reader/iterator 
that handles a single range and return a new reader for each range when using 
FileManager to 
[open](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/server/base/src/main/java/org/apache/accumulo/server/fs/FileManager.java#L503)
 the list of RFiles. 
   
   So instead of having a single iterator per RFile that handles multiple 
ranges like described in approach 1, we would just return multiple RFile 
readers, one for every range specified.
   
   The problem I see here is all of the places where we use a list of files 
each file is only uniquely identified by 
[TableFile](https://github.com/apache/accumulo/blob/b74c97c260d6867f242945ff65e718e5819be229/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L141)/StoredTableFile
 and the key is just the Path of the file. This would obviously not work in 
this case because we'd have the path duplicated now as there would be multiple 
readers for the same file. We'd need to update those classes or have a new 
class to also add a range as well as Path to uniquely identify the file. 
Updating TableFile to optionally take a Range in addition to the Path and use 
that for comparison/equality may be good enough here.
   
   A plus side is methods in FileSKVIterator such as getFirstKey()/getLastKey() 
may be easy to implement as it's just a subset of the file and since each 
reader handle a single continous range hopefully code that calls those methods 
would work without modification (or without as much modification).
   
   Also I was thinking this approach may be easier if there was a separate file 
entry in the metadata table for each range that is associated with an Rfile 
instead of just a single entry like today and we just update StoredTableFile to 
take a Range as part of the comparison/equality but we could probably still 
make it work with a single entry.
   
   Thoughts/Comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon commented on issue #1327: No-chop merges

Reply via email to