Improved MOR spark reader

Nicolas Paris Sat, 22 Jul 2023 13:14:28 -0700

I have been playing with the starrocks MOR hudi reader recently and it does an 
amazing work: it has two read paths:


1. For partitions with log files, use the merging logic
2. For partitions with only parquet files, use the cow read logic

As you know, the first path is slow bcoz it has merging overhead and can't 
provide any parquet benefit (pushdown, blooms...). In contrast, the second path 
is blazing fast.

MOR comes with tons of compaction rules, and  having such behavior makes 
possible hot/cold partition management.

One particular case is GDPR where usually old records are deleted/masked on a 
random distribution , while new partitions are free of changes.

So far spark does not make distinction between log / log free partitions and I 
suspect adding such improvement would make MOR table more performant.

I would be glad to work on such feature so please give early feedback if there 
is some blocker.

Improved MOR spark reader

Reply via email to