[GitHub] [accumulo] cshannon opened a new pull request, #3401: Start using TabletFile instead of String for referencing data files

via GitHub Sun, 14 May 2023 14:23:12 -0700


cshannon opened a new pull request, #3401:
URL: https://github.com/apache/accumulo/pull/3401

This PR replaces the now closed #3385. The goal of this PR is to start to
move away from using a String/Path representation everywhere for the file path
reference when referring to data files in order to support easier changes in
the future when adding more information besides just a path. This is to support
the future changes #1327 I am making this a draft PR for nor was before this is
merged I need to write new Javadocs for the new classes in the API and document
changes better but before I did that I wanted to get feedback first.

The current plan for supporting no chop merges is to start associating a
range with a file and treating each file that is fenced off by a range as
unique. This is going to lead to storing more than one metadata entry for the
same file if there is multiple ranges so we will need to start handling the
combination of the path and Range in the code when reading a file and not just
the path like today.

This PR is an incremental step by using TabletFile/StoredTabletFile and the
new `AbstractTabletFile` and `UnassignedTabletFile` classes instead of String
in places like `FileManager` and `FileOperations`. The idea here is that by
always using those TabletFile classes instead of a Path/String it will be
easier to read/write a fenced file as we can add a Range in the future to
AbstractTabletFile and a getRange() method.

I originally tried to just use TabletFile but ended up creating two new
classes called AbstractTabletFile and UnassignedTabletFile because TabletFile
does a lot of
[validation](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L61)
that caused several places to break in the code. There are a lot of areas
where we write to RFiles or read from RFiles but the file wouldn't pass the
validation checks inside TabletFile as TabletFile requires the path to contain
the right structure so it can be
[parsed](https://github.com/apache/accumulo/blob/2f27dbd188ea5cee32f486599b97181b136fa787/core/src/main/java/org/apache/accumulo/core/metadata/TabletFile.java#L78)
into a TabletDirectory object. This is problematic in several spots: Writing
files, Tests, WAL, Bulk import, etc. Essentially any place we would be doing
operations on files that are not part of a Tablet/Tablet yet.

I used the name `UnassignedTabletFile` but a different name can be picked of
course. I chose that name as that seemed to be the best name for it is in all
cases it's essentially a data file that hasn't actually been assigned to a
Tablet or Tablet which is why we can't parse the TabletDirectory. When we add a
Range to the API we could still fence off the new unassigned file if we wanted
to (if there's a use case), there wouldn't be anything stopping us from
assigning a range but I am going to assume the majority of cases (if not all)
there wouldn't be an actual range, it would just be infinite.

Also note that even with tracking a TabletFile we will still always need to
get the metadata string representation for serialization when dealing with the
metadata table. This is unavoidable of course and the format will change in the
future (likely to Json instead of just a file path as it will have a serialized
range) so we will need to handle that. A good example of this is
ExternalCompactionMetadata. This class already tracks a StoredTabletFile but
when data is serialized it calls to getMetaUpdateDelete() and that will be a
new format. The format won't change in this first PR but it will later so just
noting that we will need to be aware of that and make sure we can handle the
change when the Range is added.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon opened a new pull request, #3401: Start using TabletFile instead of String for referencing data files

Reply via email to