On Sat, Dec 12, 2009 at 8:58 AM, Sean Owen <[email protected]> wrote: > I've implemented this but it's still quite slow. Computing > recommendations goes from a couple hundred ms to 10 seconds. Nothing > wrong with this idea -- it's all the loading vectors and distributed > stuff that's weighing it down. >
You're not computing only one recommendation at a time, are you? I really need to read through the hadoop.item code, but in general, what is the procedure here? If you're doing work on HDFS as a M/R job, you're doing a huge batch, right? You're saying the aggregate performance is 10 seconds per recomendation across millions of recommendations, or doing a one-shot task? I feel like too much of this conversation went by and I missed some crucial piece describing the taks in a big-picture sense (and this notion is backed up by the fact that we keep talking past each other when it comes to which parts of this process are online and which are offline). Can you give a quick review of which part of this is supposed to be on Hadoop, which parts are done live, a kind of big picture description of what's going on? I think that's the culprit in fact, having to load all the column > vectors, since they're not light. > > One approach is to make the user vectors more sparse by throwing out > data, though I don't like it so much. > > One question -- in SparseVector, can't we internally remove entries > when they are set to 0.0? since implicitly missing entries are 0? > We should certainly add a "compact" method to both versions of SparseVector, which could be periodically called to remove out any zeroes and save on subsequent computational costs. -jake
