The input data format is typically 
User-id \t item-id \t (other information)

>From here it can transformed into either of the formats as they are just
1 map-red away. After transformation the input data set will contain
lines only in 1 format and not both. The data format that I use has each
line of the form

User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...

As for co-occurrence counting the way Ted mentioned, I implemented a
map-red implementation for the same and I have found it to be pretty
efficient, simple and effective too. 

Couple of tricks like only keeping top-X co-occurred items for an item
by count and emitting only those item pairs that match a certain
criteria have worked very well. 

I would like to contribute it to Mahout and filed a JIRA for the same
https://issues.apache.org/jira/browse/MAHOUT-103

I will have a patch coming soon.

What I am looking for is a complimentary technique that does not depend
so much on co-occurrences and tries to do some sort of latent variable
analysis to answer my query.

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Tuesday, January 20, 2009 11:33 PM
To: [email protected]
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Surprisingly, both forms are about the same in difficult for
co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms
are
likely the output of a log analysis where the input form is actually
more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual
co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample
the
users associated with the most common items.  This is similar to the row
and
column normalization done in singular value techniques, but is applied
to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except
for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is
that
simply emitting all of the item pairs is pretty danged efficient.  There
are
clever things to do to avoid so much data motion, but they save
surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <[email protected]> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Reply via email to