Hi folks, I am finding a method to write skewed bucketed data without performance degradation. The case is simple. We want to shuffle rows by a skewed key, colocate rows with the same key, and write rows in storage. Due to the skew, the overall performance is dominated by the slowest reducer.
I found TEZ-3209 <https://issues.apache.org/jira/browse/TEZ-3209> could be for us. It is very well-designed and it seems to be almost ready for our use case. Trying it quickly, it actually worked fine. I have two questions about this feature. 1. What real use cases of FairShuffleVertexManager do we have? I guess it is not widely used. I found only the use case of Twitter. Do we know other cases? 2. Does anyone consider using it in Hive? Actually, the application I am developing is a highly customized Hive. I think FairShuffleVertexManager can't fully replace ShuffleVertexManager in Hive because TEZ-3500 <https://issues.apache.org/jira/browse/TEZ-3500> is required for JOIN. However, it could be applicable to some vertices such as aggregations, window functions, file sinks, and so on(We need to remove this validation <https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/dag/library/vertexmanager/FairShuffleVertexManager.java#L198-L204> to support UNION). I am still a newbie to this feature and I may be missing some points. I'd like to hear any information on FairShuffleVertexManager. Regards, Okumin