Hi,Ted!
      Thank you for advising the book to me.I have look through it .It
mainly describes the infrastructure of recommendation system.But my problem
is how to generate a contiguous set of integer indexes for
the RowSimilarityJob.If use hash algorithm like MD5, there would be a lot
of integer indexes repeated.It is terrible when there are millions even
billions of users and items.I don't find  a concrete solution in the book.I
t just said in the book:
      The logs used the artist, track, and album IDs from the Postgres copy
of the MusicBrainz data. These IDs are, however, not suitable directly for
use with the RowSimilarityJob from Mahout since that program requires that
all IDs be converted to a contiguous set of integer indexes.In our Music
Machine recommender, this conversion was done using a Pig program that
produced two outputs. The first output from Pig is input for the
RowSimilarityJob, and the second output is a dictionary that records the
mapping from the original and the Mahout versions of the IDs.

2014-10-15 5:47 GMT+08:00 Ted Dunning <[email protected]>:

> You should move forward to version 0.9.
>
> Take a look at more recent methods in this book:
>
> https://www.mapr.com/practical-machine-learning
>
>
>
> On Tue, Oct 14, 2014 at 2:43 AM, 王建国 <[email protected]> wrote:
>
> > Hi,Owen and all:
> >     I am a developer from china.I am building a recommendation sysytem
> > based on mahhout in version-0.9.Since the userids and itemids are string,
> > I need to map them to long.But I found that  there is a Long-to-Int
> mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Considering there may be millions  even billions of users,I wonder if  it
> > possible to have many long mapped into one int? If ture,that does do much
> > harm .
> > This is quite confusing.What solution should I choose in this
> > situation?Meanwhile,I read the answer from you as followed.Could you
> please
> > tell me
> > which data structure indexed by long you use in Myrrix. Thanks in
> advance.
> > wangjiangwei
> >
> > Question:
> > I have read some code about item-based recommendation in version-0.6,
> > starting from "org.apache.mahout.cf.taste.
> > hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Long-to-Int is performed both on userId and itemId. I wonder if it
> possible
> > to have two long mapped into one int? If it is the case, then we would
> > likely to merge vectors from different itemids/uids, right? This is quite
> > confusing.
> > Is it better to provide a RandomAccessSparseVector implemented by
> > OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
> > Wei Feng
> > Answer:
> >     That's right. It ought to be uncommon but can happen. For
> recommenders,
> > it
> > "only" means that you start to treat two users or two items as the same
> > thing. That doesn't do much harm though. Maybe one user's recs are a
> little
> > funny.
> > I do think it would have been useful to index by long, but that would
> have
> > significantly increased memory requirements too.
> > (In developing Myrrix I have switched to use a data structure indexed by
> > long though, because it becomes more necessary to avoid the mapping.)
> > Sean Owen
> >
>

Reply via email to