Thanks Sean, I will try this one out on my dataset and keep the list posted on how well it worked.
Regards -Ankur -----Original Message----- From: Sean Owen [mailto:[email protected]] Sent: Friday, January 23, 2009 5:55 AM To: [email protected] Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer Here's a DataModel you could try out for your purposes; the rest should be as I described earlier. package org.apache.mahout.cf.taste.example; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.model.GenericPreference; import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser; import org.apache.mahout.cf.taste.impl.common.FastSet; import org.apache.mahout.cf.taste.model.Preference; import org.apache.mahout.cf.taste.model.Item; import org.apache.mahout.cf.taste.model.User; import java.io.File; import java.io.IOException; import java.util.List; import java.util.Map; import java.util.ArrayList; public final class AnkursDataModel extends FileDataModel { public AnkursDataModel(File ratingsFile) throws IOException { super(ratingsFile); } @Override protected void processLine(String line, Map<String, List<Preference>> data, Map<String, Item> itemCache) { String[] tokens = line.split("\t"); String userID = tokens[0]; List<Preference> prefs = new ArrayList<Preference>(tokens.length - 1); for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) { String itemID = tokens[tokenNum]; Item item = itemCache.get(itemID); if (item == null) { item = buildItem(itemID); itemCache.put(itemID, item); } prefs.add(new GenericPreference(null, item, 1.0)); // this is a little ugly but makes it easy to reuse FileDataModel -- pref values are tossed below } data.put(userID, prefs); } @Override protected User buildUser(String id, List<Preference> prefs) { FastSet<Object> itemIDs = new FastSet<Object>(); for (Preference pref : prefs) { itemIDs.add(pref.getItem().getID()); } return new BooleanPrefUser(id, itemIDs); } } On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <[email protected]> wrote: > The input data format is typically > User-id \t item-id \t (other information) > > From here it can transformed into either of the formats as they are just > 1 map-red away. After transformation the input data set will contain > lines only in 1 format and not both. The data format that I use has each > line of the form > > User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))... > > As for co-occurrence counting the way Ted mentioned, I implemented a > map-red implementation for the same and I have found it to be pretty > efficient, simple and effective too. > > Couple of tricks like only keeping top-X co-occurred items for an item > by count and emitting only those item pairs that match a certain > criteria have worked very well. > > I would like to contribute it to Mahout and filed a JIRA for the same > https://issues.apache.org/jira/browse/MAHOUT-103 > > I will have a patch coming soon. > > What I am looking for is a complimentary technique that does not depend > so much on co-occurrences and tries to do some sort of latent variable > analysis to answer my query. > > Thanks > -Ankur
