Hi,Owen and all:
I am a developer from china.I am building a recommendation sysytem
based on mahhout in version-0.9.Since the userids and itemids are string,
I need to map them to long.But I found that there is a Long-to-Int mapping
provided by the function "int TasteHadoopUtils.idToIndex(long)".
Considering there may be millions even billions of users,I wonder if it
possible to have many long mapped into one int? If ture,that does do much
harm .
This is quite confusing.What solution should I choose in this
situation?Meanwhile,I read the answer from you as followed.Could you please
tell me
which data structure indexed by long you use in Myrrix. Thanks in advance.
wangjiangwei
Question:
I have read some code about item-based recommendation in version-0.6,
starting from "org.apache.mahout.cf.taste.
hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
provided by the function "int TasteHadoopUtils.idToIndex(long)".
Long-to-Int is performed both on userId and itemId. I wonder if it possible
to have two long mapped into one int? If it is the case, then we would
likely to merge vectors from different itemids/uids, right? This is quite
confusing.
Is it better to provide a RandomAccessSparseVector implemented by
OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
Wei Feng
Answer:
That's right. It ought to be uncommon but can happen. For recommenders,
it
"only" means that you start to treat two users or two items as the same
thing. That doesn't do much harm though. Maybe one user's recs are a little
funny.
I do think it would have been useful to index by long, but that would have
significantly increased memory requirements too.
(In developing Myrrix I have switched to use a data structure indexed by
long though, because it becomes more necessary to avoid the mapping.)
Sean Owen