cshannon opened a new pull request, #3401:
URL: https://github.com/apache/accumulo/pull/3401

   This PR replaces the now closed #3385. The goal of this PR is to start to 
move away from using a String/Path representation everywhere for the file path 
reference when referring to data files in order to support easier changes in 
the future when adding more information besides just a path. This is to support 
the future changes #1327 I am making this a draft PR for nor was before this is 
merged I need to write new Javadocs for the new classes in the API and document 
changes better but before I did that I wanted to get feedback first.
   
   The current plan for supporting no chop merges is to start associating a 
range with a file and treating each file that is fenced off by a range as 
unique. This is going to lead to storing more than one metadata entry for the 
same file if there is multiple ranges so we will need to start handling the 
combination of the path and Range in the code when reading a file and not just 
the path like today.
   
   This PR is an incremental step by using TabletFile/StoredTabletFile and the 
new `AbstractTabletFile` and `UnassignedTabletFile` classes instead of String 
in places like `FileManager` and `FileOperations`. The idea here is that by 
always using those TabletFile classes instead of a Path/String it will be 
easier to read/write a fenced file as we can add a Range in the future to 
AbstractTabletFile and a getRange() method.
   
   I originally tried to just use TabletFile but ended up creating two new 
classes called AbstractTabletFile and UnassignedTabletFile because TabletFile 
does a lot of 
[validation](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L61)
 that caused several places to break in the code. There are a lot of areas 
where we write to RFiles or read from RFiles but the file wouldn't pass the 
validation checks inside TabletFile as TabletFile requires the path to contain 
the right structure so it can be 
[parsed](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L78)
 into a TabletDirectory object. This is problematic in several spots: Writing 
files, Tests, WAL, Bulk import, etc. Essentially any place we would be doing 
operations on files that are not part of a Tablet/Tablet yet.
   
   I used the name `UnassignedTabletFile` but a different name can be picked of 
course. I chose that name as that seemed to be the best name for it is in all 
cases it's essentially a data file that hasn't actually been assigned to a 
Tablet or Tablet which is why we can't parse the TabletDirectory. When we add a 
Range to the API we could still fence off the new unassigned file if we wanted 
to (if there's a use case), there wouldn't be anything stopping us from 
assigning a range but I am going to assume the majority of cases (if not all) 
there wouldn't be an actual range, it would just be infinite.
   
   Also note that even with tracking a TabletFile we will still always need to 
get the metadata string representation for serialization when dealing with the 
metadata table. This is unavoidable of course and the format will change in the 
future (likely to Json instead of just a file path as it will have a serialized 
range) so we will need to handle that. A good example of this is 
ExternalCompactionMetadata. This class already tracks a StoredTabletFile but 
when data is serialized it calls to getMetaUpdateDelete() and that will be a 
new format. The format won't change in this first PR but it will later so just 
noting that we will need to be aware of that and make sure we can handle the 
change when the Range is added.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to