[ https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tahir Hashmi updated HADOOP-485: -------------------------------- Attachment: 485.patch Sorry for the delayed turnaround. Had some painful issues with the build environment. Fixed javadoc warnings now. > allow a different comparator for grouping keys in calls to reduce > ----------------------------------------------------------------- > > Key: HADOOP-485 > URL: https://issues.apache.org/jira/browse/HADOOP-485 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.5.0 > Reporter: Owen O'Malley > Assigned To: Tahir Hashmi > Attachments: 485.patch, 485.patch, 485.patch, 485.patch, > Hadoop-485-pre.patch, TestUserValueGrouping.java.patch > > > Some algorithms require that the values to the reduce be sorted in a > particular order, but extending the key with the additional fields causes > them to be handled by different calls to reduce. (The user then collects the > values until they detect a "real" key change and then processes them.) > It would be much easier if the framework let you define a second comparator > that did the grouping of values for reduces. So your reduce inputs look like: > A1, V1 > A2, V2 > A3, V3 > B1, V4 > B2, V5 > instead of getting calls to reduce that look like: > reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); > reduce(B2, {V5}); > you could define the grouping comparator to just compare the letters and end > up with: > reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5}); > which is the desired outcome. Note that this assumes that the "extra" part of > the key is just for sorting because the reduce will only see the first > representative of each equivalence class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.