GitHub user yihua added a comment to the discussion: Presto/Trino support for 
files produced by metadata boostrap

Hi @vamshipasunuru 

Thanks for raising the feature request.

Currently, these are the Hudi support in Presto and Trino:
- The Trino Hudi connector 
(https://github.com/apache/hudi/tree/master/hudi-trino-plugin) supports COW and 
MOR snapshot read with MDT and data skipping, with the integration of the File 
Group Reader in Hudi 1.x.
- The Presto Hudi connector 
(https://github.com/prestodb/presto/tree/master/presto-hudi) supports Hudi 
1.0.2 for COW and MOR snapshot read with MDT.  We have a plan before to upgrade 
the reader support to use the new File Group Reader in Hudi 1.x to make the 
integration easier, similar to the implementation in Trino Hudi connector.

The bootstrap read is already supported by the new File Group Reader 
implementation.  It requires passing in the right information and reader 
context implementation for bootstrap merging for the File Group Reader to work 
on bootstrap file groups.

To support bootstrap read in the Hudi connector, the following needs to be done:
1. Extend the data model to carry bootstrap base file information (path and 
size) through splits.
  a. Extend `HudiBaseFile` to carry bootstrap base file path/size
  b. Update `HudiBaseFile.of(HoodieBaseFile)` factory methods
  c. Update `HudiUtil.convertToFileSlice()`
2. File Group Reader supporting bootstrap read for Presto and Trino
  a. Implement `mergeBootstrapReaders()` in `HudiTrinoReaderContext` or the 
Presto counterpart
  b. Verify `HudiTrinoFileReaderFactory.newBootstrapFileReader()` works for 
bootstrap read
  c. Handle the `newParquetFileReader()` call for bootstrap data files
3. Handle COW with bootstrap
  a. Route bootstrap COW splits through `HoodieFileGroupReader` (instead of 
using `HudiBaseFileOnlyPageSource`)
  b. Check if there's any performance indication
4. Handle bootstrap file sizes
  a. Split sizes should be based on the bootstrap base file size instead (right 
now the size is based on the skeleton file)
  b. Avoid splitting bootstrap base files

In general, the steps are
- Presto: (1) upgrade the Hudi connector implementation using the file group 
reader with Hudi 1.1/1.2, (2) bridging the gaps for bootstrap read in the 
connector implementation.
- Trino: bridging the gaps for bootstrap read in the connector implementation.

All of these should happen on the latest OSS releases.

@vamshipasunuru let us know if this makes sense.  This requires upgrading Trino 
and Presto as the feature will be implemented on top of the latest master 
(backporting to older releases might be possible and may take more time).

cc @vamsikarnika @voonhous @bhasudha 

GitHub link: 
https://github.com/apache/hudi/discussions/18137#discussioncomment-15863951

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to