Hi,

I've finally fed Taste some real data (in terms of volume, users, and item 
preference distribution) and quickly hit the memory limits of my development 
laptop. :).  Now I'm trying to see what, if anything, I can trim from the input 
set (the user,item,rating triplets) to lower the memory consumption. N.b. I 
don't actually have rating information - my ratings are all just "1.0" 
indicating that the item has been seen/read/consumed.

I ran one of these to see the item popularity distribution:
$ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less

And quickly saw the expected zipfian distribution.  Big head of several very 
popular items and a loooong tail of items that have been seen/read/consumed 
only a few times.

So here are my questions:
- Is there a point in keeping and loading very unpopular items (e.g.
the ones read only once)?  I think keeping those might help very few
people discover very obscure items, so removing them will hurt this
small subset of people a bit, but this will not affect the majority of
people.  Is this thinking correct?

- I'm dealing with items where their freshness counts.  I don't want to 
recommend items older than N days - think news stories.  Assume I have the age 
of each item.  I could certainly then remove old items as I don't ever want to 
recommend them, but if I remove them, won't that hurt the quality of 
recommendations, simply because I'll lose users' "item consumption history"?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to