Re: Improved MOR spark reader

Nicolas Paris Mon, 24 Jul 2023 12:40:46 -0700

>Jon is working on new Hudi Spark integration relying on a new
>implementation of the ParquetFileFormat


Sounds good, thanks for the pointer


On July 24, 2023 5:54:55 AM UTC, Y Ethan Guo <yi...@apache.org> wrote:
>Hi Nicolas,
>
>Thanks for bringing up the discussion.  Spark's MOR snapshot relation
>provides different readers for different splits such as base-file-only
>split and regular split with base and log files.
>
>https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L124
>https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L93
>
>Jon is working on new Hudi Spark integration relying on a new
>implementation of the ParquetFileFormat, so Spark optimizations can kick in
>for MOR; see draft RFC here: https://github.com/apache/hudi/pull/9235.
>Feel free to give feedback there.
>
>Best,
>- Ethan
>
>On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris <nicolas.pa...@riseup.net>
>wrote:
>
>> Just to clarify: the read path described is all about RT views here only,
>> not related to RO.
>>
>> On July 22, 2023 8:14:09 PM UTC, Nicolas Paris <nicolas.pa...@riseup.net>
>> wrote:
>> >I have been playing with the starrocks MOR hudi reader recently and it
>> does an amazing work: it has two read paths:
>> >
>> >1. For partitions with log files, use the merging logic
>> >2. For partitions with only parquet files, use the cow read logic
>> >
>> >As you know, the first path is slow bcoz it has merging overhead and
>> can't provide any parquet benefit (pushdown, blooms...). In contrast, the
>> second path is blazing fast.
>> >
>> >MOR comes with tons of compaction rules, and  having such behavior makes
>> possible hot/cold partition management.
>> >
>> >One particular case is GDPR where usually old records are deleted/masked
>> on a random distribution , while new partitions are free of changes.
>> >
>> >So far spark does not make distinction between log / log free partitions
>> and I suspect adding such improvement would make MOR table more performant.
>> >
>> >I would be glad to work on such feature so please give early feedback if
>> there is some blocker.
>>

Re: Improved MOR spark reader

Reply via email to