Sounds good. The map-only directly to DFS would be nice for efficient sampling of large datasets. I think this will require a special flag instructing the system to do this, but I think it's a great idea. The second suggestion is covered by JIRA 939. The third suggestion is good too, but a little more ugly. I would implement this by only allowing a single already aggregated input directory (i.e. not multiple disparately partitioned inputs). If you need to merge multiple already aggregated inputs, then oh well, you'll just have to take the extra disk hit. If you specify the IdentityMapper and you've set the inputAlreadyAggregated() bit, then it would do this efficient reduce. It could also just dynamically change the number of reducers to fit the input (and spit out a warning). Go ahead and file JIRA enhancement requests (I will, if you don't).
- Doug On 1/25/07, Arkady Borkovsky <[EMAIL PROTECTED]> wrote:
On Jan 25, 2007, at 11:58 AM, Doug Judd wrote: > Hi Arkady, > > I actually would like to see support for both map() and reduce(). > For example, if you know that the input is already sorted, > then it might be useful to have reduce() called on the aggregated > intermediate results. Fully agree. And the example bellow illustrates one of the reasons. However, there should be 3 different JIRA issues -- map should be able to write directly to DFS without sort and reduce -- when the mapper is identity, and the input is already grouped by key, the sort should be optimized to take advantage of this. Note that grouped-by-key (sorted) files is a typical, rather than exceptional case. -- it should be possible to feed grouped-by-key files directly to reduce bypassing partitioning and sorting. This can be only done when the number of reducers matches the number of input files. However, with sufficient meta-data, this can fly even when several input directories have different number of partitions -- as long as partitioning is compatible (e.g. if partitioning relies on min(j) <= key < min(j+1), rather than on (k % part_n) > For example, a "classic" usage would be to implement > a database join. You sort the first table on the Foreign key and then > do a > merging-only MapReduce with the second table (who's primary key is the > first > table's foreign key). Now you can see a joined view of the rows > inside the > reduce() function, whereas the map() function won't have all of the > pieces. > However, having map() is good too for projecting the input in different > ways. > > I'll go ahead and file a JIRA enhancement request. > > - Doug > > On 1/25/07, Arkady Borkovsky <[EMAIL PROTECTED]> wrote: >> >> "Disabling the sort" == "map without reduce" == "map writes the >> output >> into DFS" >> is indeed a very useful and desirable feature. >> File a JIRA issue. >> >> >> On Jan 24, 2007, at 5:32 PM, Doug Judd wrote: >> >> > After digging into this a bit, it looks like the use of >> > IdentityReducer does >> > not disable the sort. I wrote a simple Map/Reduce program that uses >> > /usr/share/dict/words as input and generates keys that are a Text >> > representation of the CRC of the word modulo 65536 and values that >> are >> > the >> > word itself. I set the reducer to be the IdentityReducer and the >> > output >> > came out sorted: >> > >> > 0 apperceptively >> > 0 Connarus >> > 1 overfold >> > 1 derationalization >> > 1 gymnasium >> > 10 respecting >> > 10 supperwards >> > 100 cellulofibrous >> > 100 drogherman >> > 100 heteroptics >> > 1000 bacao >> > 1000 Cumaean >> > 1000 didymate >> > 1000 disbelieving >> > 1001 polymer >> > 1001 salveline >> > 1001 workwomanly >> > 1002 sporty >> > 1002 bakal >> > 1003 preferentialist >> > >> > Also, after reviewing the Google paper, they make no mention of the >> > sort >> > being disabled by the Identity reducer. In fact, they describe >> their >> > Sort >> > implementation as using the identity reducer. >> > >> > Unless I'm missing something, I retract my previous statement. >> > Map-Reduce >> > is really just distributed sort. I do think that being able to >> > disable the >> > sort is a much needed enhancement, especially since quite a few >> > applications >> > don't need it. >> > >> > - Doug >> > >> > On 1/24/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> >> >> >> Doug Judd wrote: >> >> > Part of the problem is that calling the paradigm "Map-Reduce" is >> >> somewhat >> >> > misleading. It is really just a distributed sort. The sort is >> >> where >> >> > all of >> >> > the complexity comes from. Invoking map() over the input is >> O(n), >> >> > invoking >> >> > reduce() over the intermediate results is O(n) as well. The >> sort is >> >> > O(nlogn). A more appropriate name for this algorithm would be >> >> > "Distributed >> >> > Sort with a Pre-map Phase and a Post-reduce Phase" Calling it >> >> Map-Reduce >> >> > and leaving out the word "sort" (the most important part) is a >> >> source of >> >> > confusion. >> >> > >> >> > If you think of it in these terms, I think it's easier to see >> where >> >> > and how >> >> > it applies. >> >> >> >> :) Sure, that's one point of view on this - however, in quite a few >> >> applications sort is definitely less important than the ability to >> >> split >> >> the processing load in map() and reduce() over many machines. >> >> Sometimes >> >> I don't care about the sorting at all (in all cases where >> >> IdentityReducer is used). >> >> >> >> -- >> >> Best regards, >> >> Andrzej Bialecki <>< >> >> ___. ___ ___ ___ _ _ __________________________________ >> >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> >> ___|||__|| \| || | Embedded Unix, System Integration >> >> http://www.sigram.com Contact: info at sigram dot com >> >> >> >> >> >> >> >>
