Hi all,

You might have heard this repeatedly mentioned over tickets, when we talk
about Hudi paying some "tax" during write time to ensure query performance
is good.

These are conscious decisions we made, designing Uber's data lake for
scale. and sometimes these are not appreciated when trying to optimize
single Spark jobs for e.g

So, I decided to write a small demo (all working on a macbook, on some 50GB
of data and show how impactful these are). Hopefully you find it useful.

TL;DR :
- Keeping data sorted by time helps temporal queries 2-3x speed up.
- 20x reduction in file size can cause upto 3-4x degradation in query
performance.

https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53
https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e


Now, is anyone interested in turning these into blogs on hudi.apache.org?
:). referencing the right config names and showing our users how to nail
this.

Thanks
Vinoth

Reply via email to