Hey, I have several unclarity with the setting of number of tasks and I don't think it currently runs correctly.
Let's make some scenarios: 1. User defines no input and number of tasks: "vanilla"-hama behaviour -> Check if the number of tasks fit in the cluster and then run. 2. User defines input, no number of tasks and no partitioner -> this should set the #bsptasks to what the split calculated. *What if this exeeds the cluster capacity?* 3. User defines input, number of tasks and a partitioner -> this should partition the dataset via the partitioner to >number of tasks< files and let the fileinput split assign the files to the tasks. 4. User defines already defines partitioned input (e.G. Output of a M/R job), and no other stuff -> What do you think this should do? Part 4 is the most important I guess, because a mapreduce job partitions the data faster than our partitioner, especially for large inputs. And I don't actually know if all this steps are the right way we want it. What do you think? -- Thomas Jungblut Berlin <[email protected]>
