GitHub user yihua added a comment to the discussion: Presto/Trino support for files produced by metadata boostrap
Hi @vamshipasunuru Thanks for raising the feature request. Currently, these are the Hudi support in Presto and Trino: - The Trino Hudi connector (https://github.com/apache/hudi/tree/master/hudi-trino-plugin) supports COW and MOR snapshot read with MDT and data skipping, with the integration of the File Group Reader in Hudi 1.x. - The Presto Hudi connector (https://github.com/prestodb/presto/tree/master/presto-hudi) supports Hudi 1.0.2 for COW and MOR snapshot read with MDT. We have a plan before to upgrade the reader support to use the new File Group Reader in Hudi 1.x to make the integration easier, similar to the implementation in Trino Hudi connector. The bootstrap read is already supported by the new File Group Reader implementation. It requires passing in the right information and reader context implementation for bootstrap merging for the File Group Reader to work on bootstrap file groups. To support bootstrap read in the Hudi connector, the following needs to be done: 1. Extend the data model to carry bootstrap base file information (path and size) through splits. a. Extend `HudiBaseFile` to carry bootstrap base file path/size b. Update `HudiBaseFile.of(HoodieBaseFile)` factory methods c. Update `HudiUtil.convertToFileSlice()` 2. File Group Reader supporting bootstrap read for Presto and Trino a. Implement `mergeBootstrapReaders()` in `HudiTrinoReaderContext` or the Presto counterpart b. Verify `HudiTrinoFileReaderFactory.newBootstrapFileReader()` works for bootstrap read c. Handle the `newParquetFileReader()` call for bootstrap data files 3. Handle COW with bootstrap a. Route bootstrap COW splits through `HoodieFileGroupReader` (instead of using `HudiBaseFileOnlyPageSource`) b. Check if there's any performance indication 4. Handle bootstrap file sizes a. Split sizes should be based on the bootstrap base file size instead (right now the size is based on the skeleton file) b. Avoid splitting bootstrap base files In general, the steps are - Presto: (1) upgrade the Hudi connector implementation using the file group reader with Hudi 1.1/1.2, (2) bridging the gaps for bootstrap read in the connector implementation. - Trino: bridging the gaps for bootstrap read in the connector implementation. All of these should happen on the latest OSS releases. @vamshipasunuru let us know if this makes sense. This requires upgrading Trino and Presto as the feature will be implemented on top of the latest master (backporting to older releases might be possible and may take more time). cc @vamsikarnika @voonhous @bhasudha GitHub link: https://github.com/apache/hudi/discussions/18137#discussioncomment-15863951 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
