Hello all, Clustering feature landed <https://github.com/apache/hudi/pull/2263> on master branch and is available in beta. This feature can be used to do following 1) Stitch small files into larger files 2) Change data layout on disk by sorting data using different columns (for query/storage optimization)
If you are interested in the above use cases, appreciate it if you can try out this feature. I have included commands to run clustering in this section <https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering> (along with caveats as this feature is still in beta). Any feedback is welcome. I'm also on #general room in slack. Please feel free to ping me if you have any questions/comments. Thanks Satish
