This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 849b217edb5 [HUDI-3625] Update RFC-60 (#9462)
849b217edb5 is described below
commit 849b217edb563369720e528ad4487df4b57a2308
Author: Shawn Chang <[email protected]>
AuthorDate: Sun Mar 3 16:12:36 2024 -0800
[HUDI-3625] Update RFC-60 (#9462)
Co-authored-by: Shawn Chang <[email protected]>
---
rfc/rfc-60/read_flow.png | Bin 0 -> 176856 bytes
rfc/rfc-60/rfc-60.md | 99 ++++++++++++++++++++++++++++++++++++++--------
rfc/rfc-60/wrapper_fs.png | Bin 0 -> 148392 bytes
3 files changed, 83 insertions(+), 16 deletions(-)
diff --git a/rfc/rfc-60/read_flow.png b/rfc/rfc-60/read_flow.png
new file mode 100644
index 00000000000..4ef464f41e7
Binary files /dev/null and b/rfc/rfc-60/read_flow.png differ
diff --git a/rfc/rfc-60/rfc-60.md b/rfc/rfc-60/rfc-60.md
index d509aec1f20..bdfaa58b899 100644
--- a/rfc/rfc-60/rfc-60.md
+++ b/rfc/rfc-60/rfc-60.md
@@ -15,7 +15,7 @@
limitations under the License.
-->
-# RFC-60: Federated Storage Layer
+# RFC-60: Federated Storage Layout
## Proposers
- @umehrot2
@@ -52,7 +52,10 @@ but there can be a 30 - 60 minute wait time before new
partitions are created. T
same table path prefix could result in these request limits being hit for the
table prefix, specially as workloads
scale, and there are several thousands of files being written/updated
concurrently. This hurts performance due to
re-trying of failed requests affecting throughput, and result in occasional
failures if the retries are not able to
-succeed either and continue to be throttled.
+succeed either and continue to be throttled. Note an exception would be
non-partitioned tables
+reside directly under S3 buckets (using S3 buckets as their table paths), and
those tables would be free
+from the throttling problem. However, this exception cannot invalidate the
necessity of addressing the throttling
+problem for partitioned tables.
The traditional storage layout also tightly couples the partitions as folders
under the table path. However,
some users want flexibility to be able to distribute files/partitions under
multiple different paths across cloud stores,
@@ -97,22 +100,21 @@ public interface HoodieStorageStrategy extends
Serializable {
}
```
-### Generating file paths for object store optimized layout
+### Generating File Paths for Object Store Optimized Layout
We want to distribute files evenly across multiple random prefixes, instead of
following the traditional Hive storage
layout of keeping them under a common table path/prefix. In addition to the
`Table Path`, for this new layout user will
configure another `Table Storage Path` under which the actual data files will
be distributed. The original `Table Path` will
be used to maintain the table/partitions Hudi metadata.
-For the purpose of this documentation lets assume:
+For the purpose of this documentation let's assume:
```
Table Path => s3://<table_bucket>/<hudi_table_name>/
Table Storage Path => s3://<table_storage_bucket>/
```
-Note: `Table Storage Path` can be a path in the same Amazon S3 bucket or a
different bucket. For best results,
-`Table Storage Path` should be a top-level bucket instead of a prefix under
the bucket to avoid multiple
-tables sharing the prefix.
+`Table Storage Path` should be a top-level bucket instead of a prefix under
the bucket for the best results.
+So that we can avoid multiple tables sharing the prefix causing throttling.
We will use a Hashing function on the `Partition Path/File ID` to map them to
a prefix generated under `Table Storage Path`:
```
@@ -148,7 +150,7 @@
s3://<table_storage_bucket>/0bfb3d6e/<hudi_table_name>/.075f3295-def8-4a42-a927-
...
```
-Note: Storage strategy would only return a storage location instead of a full
path. In the above example,
+Storage strategy would only return a storage location instead of a full path.
In the above example,
the storage location is `s3://<table_storage_bucket>/0bfb3d6e/`, and the
lower-level folder structure would be appended
later automatically to get the actual file path. In another word,
users would only be able to customize upper-level folder structure (storage
location).
@@ -176,7 +178,7 @@ The hashing function should be made user configurable for
use cases like bucketi
sub-partitioning/re-hash to reduce the number of hash prefixes. Having too
many unique hash prefixes
would make files too dispersed, and affect performance on other operations
such as listing.
-### Maintain mapping to files
+### Maintaining Mapping to Files with Metadata Table
In
[RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331),
we introduced an internal
Metadata Table with a `files` partition that maintains mapping from partitions
to list of files in the partition stored
@@ -196,13 +198,75 @@ for metadata table to be populated.
4. If there is an error reading from Metadata table, we will not fall back
listing from file system.
-5. In case of metadata table getting corrupted or lost, we need to have a
solution here to reconstruct metadata table
-from the files which distributed using federated storage. We will likely have
to implement a file system listing
-logic, that can get all the partition to files mapping by listing all the
prefixes under the `Table Storage Path`.
-Following the folder structure of adding table name/partitions under the
prefix will help in getting the listing and
-identifying the table/partition they belong to.
+### Integration
+This section mainly describes how storage strategy is integrated with other
components and how read/write
+would look like from Hudi side with object storage layout.
+
+We propose integrating the storage strategy at the filesystem level,
specifically within `HoodieWrapperFileSystem`.
+This way, only file read/write operations undergo path conversion and we can
limit the usage of
+storage strategy to only filesystem level so other upper-level components
don't need to be aware of physical paths.
+
+This also mandates that `HoodieWrapperFileSystem` is the filesystem of choice
for all upper-level Hudi components.
+Getting filesystem from `Path` or such won't be allowed anymore as using raw
filesystem may not reach
+to physical locations without storage strategy. Hudi components can simply
call `HoodieMetaClient#getFs`
+to get `HoodieWrapperFileSystem`, and this needs to be the only allowed way
for any filesystem-related operation.
+The only exception is when we need to interact with metadata that's still
stored under the original table path,
+and we should call `HoodieMetaClient#getRawFs` in this case so
`HoodieMetaClient` can still be the single entry
+for getting filesystem.
+
+
+
+When conducting a read operation, Hudi would:
+1. Access filesystem view, `HoodieMetadataFileSystemView` specifically
+2. Scan metadata table via filesystem view to compose `HoodieMetadataPayload`
+3. Call `HoodieMetadataPayload#getFileStatuses` and employ
`HoodieWrapperFileSystem` to get
+file statuses with physical locations
+
+This flow can be concluded in the chart below.
+
+
+
+#### Considerations
+- Path conversion happens on the fly when reading/writing files. This saves
Hudi from storing physical locations,
+and adds the cost of hashing, but the performance burden should be negligible.
+- Since table path and data path will most likely have different top-level
folders/authorities,
+`HoodieWrapperFileSystem` should maintain at least two `FileSystem` objects:
one to access table path and another
+to access storage path. `HoodieWrapperFileSystem` should intelligently tell if
it needs
+to convert the path by checking the path on the fly.
+- When using Hudi file reader/writer implementation, we will need to pass
`HoodieWrapperFileSystem` down
+to parent reader. For instance, when using `HoodieAvroHFileReader`, we will
need to pass `HoodieWrapperFileSystem`
+to `HFile.Reader` so it can have access to storage strategy. If reader/writer
doesn't take filesystem
+directly (e.g. `ParquetFileReader` only takes `Configuration` and `Path` for
reading), then we will
+need to register `HoodieWrapperFileSystem` to `Configuration` so it can be
initialized/used later.
+
+### Repair Tool
+In case of metadata table getting corrupted or lost, we need to have a
solution here to reconstruct metadata table
+from the files that are distributed using federated storage. We will need a
repair tool
+to get all the partition to files mapping by listing all the prefixes under
the `Table Storage Path`
+and then reconstruct metadata table.
+
+In Hudi we already have `HoodieBackedTableMetadataWriter` to list existing
data files to initialize/construct
+metadata table. We can extract the logic of listing files and get partition
info to a new method `getPartitionInfo`,
+and then extend `HoodieBackedTableMetadataWriter` and override
`getPartitionInfo` so
+for repair tool it can list data files stored under storage path instead of
table path.
-### Query Side Integration
+```java
+ public class StorageRepairMetadataWriter extends
SparkHoodieBackedTableMetadataWriter {
+ <T extends SpecificRecordBase> StorageRepairMetadataWriter(Configuration
hadoopConf,
+
HoodieWriteConfig writeConfig,
+
HoodieEngineContext engineContext,
+ Option<String>
inflightInstantTimestamp) {
+ super(hadoopConf, writeConfig, HoodieFailedWritesCleaningPolicy.EAGER,
engineContext, inflightInstantTimestamp);
+ }
+
+ @Override
+ protected Map<String, Map<String, Long>> getPartitionToFilesMap() {
+ return listFilesUnderStoragePath();
+ }
+ }
+```
+
+### Query Engine Side Integration
Spark, Hive,
[Presto](https://github.com/prestodb/presto/commit/ef1fd25c582631513ccdd097e0a654cda44ec3dc),
and [Trino](https://github.com/trinodb/trino/pull/10228) are already
integrated to use metadata based listing.
@@ -224,4 +288,7 @@ should not be user's responsibility to enable metadata
listing from query engine
- We need a tool to bootstrap existing Hudi table to switch to another storage
strategy.
- Partition-level storage strategy: Each partition can have its own storage
strategy for users to have
finer grasp on how data is stored. It would also make new storage strategies
more accessible for
-existing Hudi tables as they would only need to re-construct the metadata
table.
\ No newline at end of file
+existing Hudi tables as they would only need to re-construct the metadata
table.
+- For the first cut, we would only have 2 `FileSystem` objects in
`HoodieWrapperFileSystem`, and this
+prevents users from distributing their data across multiple different buckets.
We'll need to support
+this in the future.
\ No newline at end of file
diff --git a/rfc/rfc-60/wrapper_fs.png b/rfc/rfc-60/wrapper_fs.png
new file mode 100644
index 00000000000..179d41b9c29
Binary files /dev/null and b/rfc/rfc-60/wrapper_fs.png differ