Hi all, You might have heard this repeatedly mentioned over tickets, when we talk about Hudi paying some "tax" during write time to ensure query performance is good.
These are conscious decisions we made, designing Uber's data lake for scale. and sometimes these are not appreciated when trying to optimize single Spark jobs for e.g So, I decided to write a small demo (all working on a macbook, on some 50GB of data and show how impactful these are). Hopefully you find it useful. TL;DR : - Keeping data sorted by time helps temporal queries 2-3x speed up. - 20x reduction in file size can cause upto 3-4x degradation in query performance. https://gist.github.com/vinothchandar/5544a92e616094c049f58c152faf0a53 https://gist.github.com/vinothchandar/d7fa1338cddfae68390afcdfe310f94e Now, is anyone interested in turning these into blogs on hudi.apache.org? :). referencing the right config names and showing our users how to nail this. Thanks Vinoth