TL;DR: Why do we send all empty partitions via DME if destination only checks if sourceIndex is empty?
While investigating TEZ-3115. I ran a job to check the MapHost string memory need reduction. Here is the job i ran (LEGACY sorter enabled). HADOOP_CLASSPATH="$TEZ_CONF_DIR:$TEZ_HOME/*:$TEZ_HOME/lib/*" yarn jar $TEZ_HOME/tez-examples-*.jar orderedwordcount -Dtez.shuffle-vertex-manager.enable.auto-parallel=true Gutenberg2 owc2 200000 Due to Auto-parallelism, reducers get reduced to 20 and there is a large number of empty partitions. The DMEs cause the downstream tasks to OOM. Each DME contains 22Kb of empty partition information. When I check *ShuffleInputEventHandlerImpl**.processDataMovementEvent *and *ShuffleInputEventHandlerOrderedGrouped**.processDataMovementEvent,* then deserialize the empty partitions and mark it done if the sourceIndex is in the empty partition list. My question is 1) is the transfer of all the empty partitions necessary and 2) is the logic correct? Regards, jeagles
