Hey,

I have several unclarity with the setting of number of tasks and I don't
think it currently runs correctly.

Let's make some scenarios:

1. User defines no input and number of tasks: "vanilla"-hama behaviour ->
Check if the number of tasks fit in the cluster and then run.

2. User defines input, no number of tasks and no partitioner -> this should
set the #bsptasks to what the split calculated. *What if this exeeds the
cluster capacity?*

3. User defines input, number of tasks and a partitioner -> this should
partition the dataset via the partitioner to >number of tasks< files and
let the fileinput split assign the files to the tasks.

4. User defines already defines partitioned input (e.G. Output of a M/R
job), and no other stuff -> What do you think this should do?

Part 4 is the most important I guess, because a mapreduce job partitions
the data faster than our partitioner, especially for large inputs.
And I don't actually know if all this steps are the right way we want it.
What do you think?

-- 
Thomas Jungblut
Berlin <[email protected]>

Reply via email to