VertexManager plugin is set on the vertex via the DAG API. Since it's a logical user concept this must be set by the user. We currently internally set ShuffleVertexManager plugin whenever the plugin is not set and there is a scatter-gather edge. This is going to change and it will be necessary to set this.
So to turn off this behavior (say when doing range partitioning) the ShuffleVertexManager should not be set on that vertex (probably set a different manager) There is a payload associated with each vertex manager when specified on the DAG API. So each ShuffleVertexManager can be configured differently using its own payload. Not much planned other than improvments to the heuristic as and when something shows up. Hive does its own parallelism calculation using its stats during compilation. It should be enabling ARP but AFAIK it has not yet done so. Bikas -----Original Message----- From: Rohini Palaniswamy [mailto:[email protected]] Sent: Sunday, March 16, 2014 12:10 PM To: [email protected] Subject: Automatic Reducer Parallelism Hi, I was looking at configuring ARP for Pig on Tez. My understanding of what is available currently is: ShuffleVertexManager is the one that currently supports auto parallelism. If TEZ_AM_SHUFFLE_VERTEX_MANAGER_ENABLE_AUTO_PARALLEL is set to true, then based on TEZ_AM_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE and TEZ_AM_SHUFFLE_VERTEX_MANAGER_MIN_TASK_PARALLELISM, parallelism is computed based on stats from some of the completed map tasks after the slow start threshold for reducers kick in and reducer tasks are started. Questions: 1) Since it is a AM level setting,looks like it is possible to say do not apply auto parallelism for this vertex. Is that correct? Pig has a PARALLEL clause which allows users to set parallelism for a particular operation like JOIN, GROUP BY or ORDER BY. We would like to honor that and use automatic parallelism only for operations where user has not defined PARALLEL. Also when there is a custom partitioner involved (like range partitioning in case of order by) we do not want ARP to kick in. Is it possible to turn on or off ARP per vertex? 2) How is ARP used in hive? 3) Any other things we need to know about ARP? Any new optimizations or changes planned? Regards, Rohini -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
