[ 
https://issues.apache.org/jira/browse/HADOOP-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647383#action_12647383
 ] 

Matei Zaharia commented on HADOOP-4627:
---------------------------------------

I'm not sure whether this is the right place for this, but I have run into 
several problems with gridmix that are not addressed by this JIRA:

1) Although the gridmix contains small, medium and large jobs, the small jobs 
all run first, and then the medium jobs, then the large jobs. In other words, 
it's not really testing a mix that would occur from users submitting jobs of 
various sizes at various times. I'm not sure whether this is intended but I 
think it's not representative of how most multi-user Hadoop clusters are used. 
The solution would be to submit the same numbers of jobs, but choose a random 
permutation of sizes.

2) The small jobs in gridmix all use the same input files - parts 0, 1 and 2 of 
the data set. This leads to a hotspot and to poor data locality. (The medium 
jobs also use the same input files, but there are 30 of those instead of 3 so 
maybe they have a higher chance of being spread-out).

3) For at least some job sizes, gridmix does not submit jobs fast enough to 
keep the cluster fully utilized. The sleep 10 in sleep_if_too_busy is part of 
the problem.

Do you want to fix these as part of this JIRA, or should I open another one? I 
have a fix for issue 2 that runs Ruby to select random input files for each 
job. In the long term it might be worthwhile to write other parts of the 
gridmix in a scripting language like Ruby or Python to make it easier to 
program.

> gridmix version 2
> -----------------
>
>                 Key: HADOOP-4627
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4627
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Runping Qi
>         Attachments: H-4627.txt
>
>
> The new gridmix differs from the original gridmix in the following ways:
> 1. Use an xml config file to specify the types and sizes mix of a mix load. 
> This provides better granularity control.
> 2. Use JobControl to submit gridmix load, instead of shell script.
> 3. Include Pig jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to