[GitHub] [accumulo] cshannon commented on issue #1327: No-chop merges

via GitHub Fri, 14 Apr 2023 13:05:34 -0700


cshannon commented on issue #1327:
URL: https://github.com/apache/accumulo/issues/1327#issuecomment-1509174746


   I talked to @keith-turner about this quite a bit today and we came up with a 
bit of an alternative strategy to what I was trying with my original two draft 
PRs (#3246 and #3286) where I was trying to handle multiple ranges per file 
with fencing and still just storing a single file metadata entry per RFile. 
   
   After talking through everything I am going to try the following in one or 
more new PRs to handle both the reading/fencing case and then the storing of 
metadata ranges.
   
   1. After going through the scenarios with how fencing off rfiles might be 
used with merges, splits, scans, etc we think it might be better to go with 
treating each range as its own file. (Basically a variation of option 1 I 
detailed in my post 
[here](https://github.com/apache/accumulo/issues/1327#issuecomment-1427102074)).
 The idea being that if we can treat each range as its own file the rest of the 
code wouldn't need as much modification as it's still just dealing with file 
abstractions.
   2. We would only need to create a Fenced Rfile iterator to handle a single 
range (wouldn't need an iterator to handle multiple ranges anymore). It's to be 
determined if the fencing iterator can just implement SortedKeyValueIterator or 
needs to also implement FileSKVIterator.  We may also need to fence the index 
as well.
   3. For storing files and ranges in the metadata table (DataFileValue) we 
realized that it may be better to associate a file metadata entry per range and 
not try and store multiple ranges for a single file entry.  This should work 
better because after thinking about how the the metadata is used for splits, 
etc we realized that the current DataFileValue fields of size, numEntries, and 
time really should be associated per Range and not per file. To accomplish this 
we think it could work to change the DataFile column qualifier 
(StoredTabletFile) to also include an optional range instead of just the URI to 
make it unique per range so you'd end up with 1 to many entries per file stored 
(just 1 entry still if no range or range that covers the entire file).
   4. The code (file operations, scans, etc) that deal with StoredTableFiles 
would hopefully not need a lot of modification if we can encapsulate the 
fencing and range handling in the iterator and encapsulate the range in 
StoredTabletFile so that they are just treated like normal files. In other 
words (for example) if we do it right hopefully the code that iterates/scans 
over 10 unique files vs 10 "files" that are really just 10 unique ranges of 1 
file would be identical as the code scanning wouldn't know or care about the 
difference.
   
   Anyways, I'm going to work on it and see how it goes. It may not work as 
well in practice or could run into some roadblocks but if it works it could 
make things a lot cleaner. As I said, I'll do the work in a new PR(s) and keep 
the current ones open so we can compare the difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon commented on issue #1327: No-chop merges

Reply via email to