[ https://issues.apache.org/jira/browse/MAHOUT-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-317: ------------------------------- Attachment: MAHOUT-317.patch oops. Attached is the fixed patch. In the light of day it becomes clear that this is pretty inefficient. GramTuple isn't the right way to do this. The compound key used in the first pass doesn't need to hold the value at all, simply the original key which will be grouped upon in the reducer and something else (a byte) to specify the secondary sort order. The compound key and group comparator should also implement binary comparable to avoid the need to unmarshal to compare. I will work this up submit another patch, I'm merely including this one for reference. > Collocations: Eliminate in-memory frequency calculation > ------------------------------------------------------- > > Key: MAHOUT-317 > URL: https://issues.apache.org/jira/browse/MAHOUT-317 > Project: Mahout > Issue Type: Improvement > Affects Versions: 0.3 > Reporter: Drew Farris > Fix For: 0.3 > > Attachments: MAHOUT-317.patch, MAHOUT-317.patch > > > see: > http://www.lucidimagination.com/search/document/ae484d53e969250e/who_owns_mahout_bucket_on_s3 > The collocation code currently uses maps in the CollocCombiner and > CollocReducer to perform frequency calculations which can cause the process > to exceed the heap space if a large number of ngrams exist for any given > subgram. > Convert the code to use a composite key / secondary sort to avoid the need > for in-memory map for frequency calculations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.