The input data format is typically User-id \t item-id \t (other information)
>From here it can transformed into either of the formats as they are just 1 map-red away. After transformation the input data set will contain lines only in 1 format and not both. The data format that I use has each line of the form User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))... As for co-occurrence counting the way Ted mentioned, I implemented a map-red implementation for the same and I have found it to be pretty efficient, simple and effective too. Couple of tricks like only keeping top-X co-occurred items for an item by count and emitting only those item pairs that match a certain criteria have worked very well. I would like to contribute it to Mahout and filed a JIRA for the same https://issues.apache.org/jira/browse/MAHOUT-103 I will have a patch coming soon. What I am looking for is a complimentary technique that does not depend so much on co-occurrences and tries to do some sort of latent variable analysis to answer my query. Thanks -Ankur -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Tuesday, January 20, 2009 11:33 PM To: [email protected] Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer Surprisingly, both forms are about the same in difficult for co-occurrence. Firstly, both forms are a single map-reduce apart. Secondly, both forms are likely the output of a log analysis where the input form is actually more likely user/item pairs. From that form, co-occurrence counting is most easily done by reducing on user, emitting all pairs of items and then counting in traditional wise. But with very large data sets, even before doing the actual co-occurrence, it is commonly advisable to reduce to item-major form and down-sample the users associated with the most common items. This is similar to the row and column normalization done in singular value techniques, but is applied to the original data. Map-reduce is pretty impressive though; sampling is not necessary except for the largest data sets on the smallest clusters. The biggest surprise I have had in using this sort of data reduction is that simply emitting all of the item pairs is pretty danged efficient. There are clever things to do to avoid so much data motion, but they save surprisingly little and are much more complex to implement (correctly). On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <[email protected]> wrote: > how can I tell when a line specifies the opposite, item > followed by user IDs? the former is easier, BTW. > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
