Naama Kraus wrote:
..
What if the mission was the following - for each course in the table,
calculate the average grade in that course. In that case both map and reduce
are required, is that correct ? Map will emit for each row a {course_name,
grade} pair. Reduce will emit the average grades for each course
(course_name, avg_grade}. Output can be put in a separate table (probably
one holding courses information). Does this make sense ?


That'll work.

* At a higher level, I'd suggest a refactoring.  Do all of your work in
the map phase.  Have no reduce phase.  I suggest this because as is, all
rows emitted by the map are being sorted by the MR framework.  But hbase
will also do a sort on insert.   Avoid paying the prices of the MR sort.  Do
your calculation in the map and then insert the result at map time.   Either
emit nothing or, emit a '1' for every row processed so the MR counters tell
a story about your MR job.*


That's an interesting point. So if both map and reduce are a required, then
two sorts must take place. Is that correct ?
Yes but with your new example, they are orthogonal toward different ends; the first does collecting together all course data and the second orders courses in hbase lexicographically (presuming course is primary key).

St.Ack

Reply via email to