>Jon is working on new Hudi Spark integration relying on a new >implementation of the ParquetFileFormat
Sounds good, thanks for the pointer On July 24, 2023 5:54:55 AM UTC, Y Ethan Guo <yi...@apache.org> wrote: >Hi Nicolas, > >Thanks for bringing up the discussion. Spark's MOR snapshot relation >provides different readers for different splits such as base-file-only >split and regular split with base and log files. > >https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L124 >https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L93 > >Jon is working on new Hudi Spark integration relying on a new >implementation of the ParquetFileFormat, so Spark optimizations can kick in >for MOR; see draft RFC here: https://github.com/apache/hudi/pull/9235. >Feel free to give feedback there. > >Best, >- Ethan > >On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris <nicolas.pa...@riseup.net> >wrote: > >> Just to clarify: the read path described is all about RT views here only, >> not related to RO. >> >> On July 22, 2023 8:14:09 PM UTC, Nicolas Paris <nicolas.pa...@riseup.net> >> wrote: >> >I have been playing with the starrocks MOR hudi reader recently and it >> does an amazing work: it has two read paths: >> > >> >1. For partitions with log files, use the merging logic >> >2. For partitions with only parquet files, use the cow read logic >> > >> >As you know, the first path is slow bcoz it has merging overhead and >> can't provide any parquet benefit (pushdown, blooms...). In contrast, the >> second path is blazing fast. >> > >> >MOR comes with tons of compaction rules, and having such behavior makes >> possible hot/cold partition management. >> > >> >One particular case is GDPR where usually old records are deleted/masked >> on a random distribution , while new partitions are free of changes. >> > >> >So far spark does not make distinction between log / log free partitions >> and I suspect adding such improvement would make MOR table more performant. >> > >> >I would be glad to work on such feature so please give early feedback if >> there is some blocker. >>