Hi folks,

I am finding a method to write skewed bucketed data without performance
degradation. The case is simple. We want to shuffle rows by a skewed key,
colocate rows with the same key, and write rows in storage. Due to the
skew, the overall performance is dominated by the slowest reducer.

I found TEZ-3209 <https://issues.apache.org/jira/browse/TEZ-3209> could be
for us. It is very well-designed and it seems to be almost ready for our
use case. Trying it quickly, it actually worked fine. I have two questions
about this feature.

1. What real use cases of FairShuffleVertexManager do we have?

I guess it is not widely used. I found only the use case of Twitter. Do we
know other cases?

2. Does anyone consider using it in Hive?

Actually, the application I am developing is a highly customized Hive. I
think FairShuffleVertexManager can't fully replace ShuffleVertexManager in
Hive because TEZ-3500 <https://issues.apache.org/jira/browse/TEZ-3500> is
required for JOIN. However, it could be applicable to some vertices such as
aggregations, window functions, file sinks, and so on(We need to remove this
validation
<https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/dag/library/vertexmanager/FairShuffleVertexManager.java#L198-L204>
to
support UNION).

I am still a newbie to this feature and I may be missing some points. I'd
like to hear any information on FairShuffleVertexManager.

Regards,
Okumin

Reply via email to