cshannon opened a new pull request, #3401: URL: https://github.com/apache/accumulo/pull/3401
This PR replaces the now closed #3385. The goal of this PR is to start to move away from using a String/Path representation everywhere for the file path reference when referring to data files in order to support easier changes in the future when adding more information besides just a path. This is to support the future changes #1327 I am making this a draft PR for nor was before this is merged I need to write new Javadocs for the new classes in the API and document changes better but before I did that I wanted to get feedback first. The current plan for supporting no chop merges is to start associating a range with a file and treating each file that is fenced off by a range as unique. This is going to lead to storing more than one metadata entry for the same file if there is multiple ranges so we will need to start handling the combination of the path and Range in the code when reading a file and not just the path like today. This PR is an incremental step by using TabletFile/StoredTabletFile and the new `AbstractTabletFile` and `UnassignedTabletFile` classes instead of String in places like `FileManager` and `FileOperations`. The idea here is that by always using those TabletFile classes instead of a Path/String it will be easier to read/write a fenced file as we can add a Range in the future to AbstractTabletFile and a getRange() method. I originally tried to just use TabletFile but ended up creating two new classes called AbstractTabletFile and UnassignedTabletFile because TabletFile does a lot of [validation](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L61) that caused several places to break in the code. There are a lot of areas where we write to RFiles or read from RFiles but the file wouldn't pass the validation checks inside TabletFile as TabletFile requires the path to contain the right structure so it can be [parsed](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L78) into a TabletDirectory object. This is problematic in several spots: Writing files, Tests, WAL, Bulk import, etc. Essentially any place we would be doing operations on files that are not part of a Tablet/Tablet yet. I used the name `UnassignedTabletFile` but a different name can be picked of course. I chose that name as that seemed to be the best name for it is in all cases it's essentially a data file that hasn't actually been assigned to a Tablet or Tablet which is why we can't parse the TabletDirectory. When we add a Range to the API we could still fence off the new unassigned file if we wanted to (if there's a use case), there wouldn't be anything stopping us from assigning a range but I am going to assume the majority of cases (if not all) there wouldn't be an actual range, it would just be infinite. Also note that even with tracking a TabletFile we will still always need to get the metadata string representation for serialization when dealing with the metadata table. This is unavoidable of course and the format will change in the future (likely to Json instead of just a file path as it will have a serialized range) so we will need to handle that. A good example of this is ExternalCompactionMetadata. This class already tracks a StoredTabletFile but when data is serialized it calls to getMetaUpdateDelete() and that will be a new format. The format won't change in this first PR but it will later so just noting that we will need to be aware of that and make sure we can handle the change when the Range is added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
