This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch branch-0.x in repository https://gitbox.apache.org/repos/asf/hudi.git
commit e2ec366b82005b0eab533a2ca3ba6c8a0b5850cc Author: Shawn Chang <[email protected]> AuthorDate: Sun Mar 3 16:12:36 2024 -0800 [HUDI-3625] Update RFC-60 (#9462) Co-authored-by: Shawn Chang <[email protected]> --- rfc/rfc-60/read_flow.png | Bin 0 -> 176856 bytes rfc/rfc-60/rfc-60.md | 99 ++++++++++++++++++++++++++++++++++++++-------- rfc/rfc-60/wrapper_fs.png | Bin 0 -> 148392 bytes 3 files changed, 83 insertions(+), 16 deletions(-) diff --git a/rfc/rfc-60/read_flow.png b/rfc/rfc-60/read_flow.png new file mode 100644 index 00000000000..4ef464f41e7 Binary files /dev/null and b/rfc/rfc-60/read_flow.png differ diff --git a/rfc/rfc-60/rfc-60.md b/rfc/rfc-60/rfc-60.md index d509aec1f20..bdfaa58b899 100644 --- a/rfc/rfc-60/rfc-60.md +++ b/rfc/rfc-60/rfc-60.md @@ -15,7 +15,7 @@ limitations under the License. --> -# RFC-60: Federated Storage Layer +# RFC-60: Federated Storage Layout ## Proposers - @umehrot2 @@ -52,7 +52,10 @@ but there can be a 30 - 60 minute wait time before new partitions are created. T same table path prefix could result in these request limits being hit for the table prefix, specially as workloads scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to -succeed either and continue to be throttled. +succeed either and continue to be throttled. Note an exception would be non-partitioned tables +reside directly under S3 buckets (using S3 buckets as their table paths), and those tables would be free +from the throttling problem. However, this exception cannot invalidate the necessity of addressing the throttling +problem for partitioned tables. The traditional storage layout also tightly couples the partitions as folders under the table path. However, some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, @@ -97,22 +100,21 @@ public interface HoodieStorageStrategy extends Serializable { } ``` -### Generating file paths for object store optimized layout +### Generating File Paths for Object Store Optimized Layout We want to distribute files evenly across multiple random prefixes, instead of following the traditional Hive storage layout of keeping them under a common table path/prefix. In addition to the `Table Path`, for this new layout user will configure another `Table Storage Path` under which the actual data files will be distributed. The original `Table Path` will be used to maintain the table/partitions Hudi metadata. -For the purpose of this documentation lets assume: +For the purpose of this documentation let's assume: ``` Table Path => s3://<table_bucket>/<hudi_table_name>/ Table Storage Path => s3://<table_storage_bucket>/ ``` -Note: `Table Storage Path` can be a path in the same Amazon S3 bucket or a different bucket. For best results, -`Table Storage Path` should be a top-level bucket instead of a prefix under the bucket to avoid multiple -tables sharing the prefix. +`Table Storage Path` should be a top-level bucket instead of a prefix under the bucket for the best results. +So that we can avoid multiple tables sharing the prefix causing throttling. We will use a Hashing function on the `Partition Path/File ID` to map them to a prefix generated under `Table Storage Path`: ``` @@ -148,7 +150,7 @@ s3://<table_storage_bucket>/0bfb3d6e/<hudi_table_name>/.075f3295-def8-4a42-a927- ... ``` -Note: Storage strategy would only return a storage location instead of a full path. In the above example, +Storage strategy would only return a storage location instead of a full path. In the above example, the storage location is `s3://<table_storage_bucket>/0bfb3d6e/`, and the lower-level folder structure would be appended later automatically to get the actual file path. In another word, users would only be able to customize upper-level folder structure (storage location). @@ -176,7 +178,7 @@ The hashing function should be made user configurable for use cases like bucketi sub-partitioning/re-hash to reduce the number of hash prefixes. Having too many unique hash prefixes would make files too dispersed, and affect performance on other operations such as listing. -### Maintain mapping to files +### Maintaining Mapping to Files with Metadata Table In [RFC-15](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427331), we introduced an internal Metadata Table with a `files` partition that maintains mapping from partitions to list of files in the partition stored @@ -196,13 +198,75 @@ for metadata table to be populated. 4. If there is an error reading from Metadata table, we will not fall back listing from file system. -5. In case of metadata table getting corrupted or lost, we need to have a solution here to reconstruct metadata table -from the files which distributed using federated storage. We will likely have to implement a file system listing -logic, that can get all the partition to files mapping by listing all the prefixes under the `Table Storage Path`. -Following the folder structure of adding table name/partitions under the prefix will help in getting the listing and -identifying the table/partition they belong to. +### Integration +This section mainly describes how storage strategy is integrated with other components and how read/write +would look like from Hudi side with object storage layout. + +We propose integrating the storage strategy at the filesystem level, specifically within `HoodieWrapperFileSystem`. +This way, only file read/write operations undergo path conversion and we can limit the usage of +storage strategy to only filesystem level so other upper-level components don't need to be aware of physical paths. + +This also mandates that `HoodieWrapperFileSystem` is the filesystem of choice for all upper-level Hudi components. +Getting filesystem from `Path` or such won't be allowed anymore as using raw filesystem may not reach +to physical locations without storage strategy. Hudi components can simply call `HoodieMetaClient#getFs` +to get `HoodieWrapperFileSystem`, and this needs to be the only allowed way for any filesystem-related operation. +The only exception is when we need to interact with metadata that's still stored under the original table path, +and we should call `HoodieMetaClient#getRawFs` in this case so `HoodieMetaClient` can still be the single entry +for getting filesystem. + + + +When conducting a read operation, Hudi would: +1. Access filesystem view, `HoodieMetadataFileSystemView` specifically +2. Scan metadata table via filesystem view to compose `HoodieMetadataPayload` +3. Call `HoodieMetadataPayload#getFileStatuses` and employ `HoodieWrapperFileSystem` to get +file statuses with physical locations + +This flow can be concluded in the chart below. + + + +#### Considerations +- Path conversion happens on the fly when reading/writing files. This saves Hudi from storing physical locations, +and adds the cost of hashing, but the performance burden should be negligible. +- Since table path and data path will most likely have different top-level folders/authorities, +`HoodieWrapperFileSystem` should maintain at least two `FileSystem` objects: one to access table path and another +to access storage path. `HoodieWrapperFileSystem` should intelligently tell if it needs +to convert the path by checking the path on the fly. +- When using Hudi file reader/writer implementation, we will need to pass `HoodieWrapperFileSystem` down +to parent reader. For instance, when using `HoodieAvroHFileReader`, we will need to pass `HoodieWrapperFileSystem` +to `HFile.Reader` so it can have access to storage strategy. If reader/writer doesn't take filesystem +directly (e.g. `ParquetFileReader` only takes `Configuration` and `Path` for reading), then we will +need to register `HoodieWrapperFileSystem` to `Configuration` so it can be initialized/used later. + +### Repair Tool +In case of metadata table getting corrupted or lost, we need to have a solution here to reconstruct metadata table +from the files that are distributed using federated storage. We will need a repair tool +to get all the partition to files mapping by listing all the prefixes under the `Table Storage Path` +and then reconstruct metadata table. + +In Hudi we already have `HoodieBackedTableMetadataWriter` to list existing data files to initialize/construct +metadata table. We can extract the logic of listing files and get partition info to a new method `getPartitionInfo`, +and then extend `HoodieBackedTableMetadataWriter` and override `getPartitionInfo` so +for repair tool it can list data files stored under storage path instead of table path. -### Query Side Integration +```java + public class StorageRepairMetadataWriter extends SparkHoodieBackedTableMetadataWriter { + <T extends SpecificRecordBase> StorageRepairMetadataWriter(Configuration hadoopConf, + HoodieWriteConfig writeConfig, + HoodieEngineContext engineContext, + Option<String> inflightInstantTimestamp) { + super(hadoopConf, writeConfig, HoodieFailedWritesCleaningPolicy.EAGER, engineContext, inflightInstantTimestamp); + } + + @Override + protected Map<String, Map<String, Long>> getPartitionToFilesMap() { + return listFilesUnderStoragePath(); + } + } +``` + +### Query Engine Side Integration Spark, Hive, [Presto](https://github.com/prestodb/presto/commit/ef1fd25c582631513ccdd097e0a654cda44ec3dc), and [Trino](https://github.com/trinodb/trino/pull/10228) are already integrated to use metadata based listing. @@ -224,4 +288,7 @@ should not be user's responsibility to enable metadata listing from query engine - We need a tool to bootstrap existing Hudi table to switch to another storage strategy. - Partition-level storage strategy: Each partition can have its own storage strategy for users to have finer grasp on how data is stored. It would also make new storage strategies more accessible for -existing Hudi tables as they would only need to re-construct the metadata table. \ No newline at end of file +existing Hudi tables as they would only need to re-construct the metadata table. +- For the first cut, we would only have 2 `FileSystem` objects in `HoodieWrapperFileSystem`, and this +prevents users from distributing their data across multiple different buckets. We'll need to support +this in the future. \ No newline at end of file diff --git a/rfc/rfc-60/wrapper_fs.png b/rfc/rfc-60/wrapper_fs.png new file mode 100644 index 00000000000..179d41b9c29 Binary files /dev/null and b/rfc/rfc-60/wrapper_fs.png differ
