each row is : user, movie, rating, time. user,mv,rating,somedate 1,2,3,date 1,3,4,date, 2,3,4,date, 2,4,5,date, 3,5,1,date 4,4,1,date 4,4,3,date
so here for user 2, movie 4, the new column called avg disimilar rating should be 1. for user 4, movie 4, the new column should be 5. this column is a disimilar column. Dhruv -----Original Message----- From: "Matthew Dowle" [[email protected]] Date: 02/23/2012 05:22 AM To: "DS" <[email protected]> CC: [email protected] Subject: Re: converting sql to data table with subqueries and exists clauses It's an interesting problem. Thanks for keeping it onlist too. What I don't understand is the essence of the query. Are ratings 0-10? If so, if you find all the ratings more than 4 different, say there are 2 0's and 2 10's, then the average will be 5. Is that what it finds? Why find the average of the extremes? If there was great disagreement amongst users I think I'd want a measure of dispersion. So I must have misunderstood. As a movie renter myself, what I'd like to know is the average rating of movies I haven't watched, by the users who have watched the same movies as me, and, agreed with my ratings of those movies. That gives me an incentive to rate movies I have watched (because the movie ratings personal to me improve as the intersect of movies I've watched and rated intersects with other users). Looking at the code anyway, I'm confused by the very first line : # get all users for movie b<- data.table(dt[dt$itemid==a$itemid & dt$userid!=a$userid,list(userid)]) The comment doesn't seem close to what that does. The users for a movie is just : DT[,unique(userid),by=itemid] which I guess is the same as DT[,userid,by=itemid] unless, a user watches the same movie twice? If they watch it two or more times, can they rate it differently each time. So a preliminary step to filter the data to find the the last rating by user by film may be first step? Some example data would really help. Maybe 20-50 rows, 3 films and 5 users? Matthew > I took a crack at converting the following sql to data table. > > it is working off of the movielens rating data (userid,rating,itemid, > timestamp) > >> (select ave(rating) from data x >> where x.itemid=a.itemid and x.userid!=a.userid >> and exists (select 1 from >> (select * from data y where y.userid=x.userid) y, >> (select * from data z where z.userid=a.userid) z >> where y.itemid=z.itemid >> and y.itemid!=a.itemid >> and abs(y.rating-z.rating)>=4 ) >> ) pred_rating_disimilar >> from data1 a') >> > > my data table attempt is below but it takes 13 hours to run: > #create columns in dataframe to hold results > dt$meanDisimilar<-NA > dt$countDisimilar<-NA > > system.time( > for (i in 1:nrow(data1)) > { > > a<- dt[i,] > > # get all users for movie > b<- data.table(dt[dt$itemid==a$itemid & > dt$userid!=a$userid,list(userid)]) > > # get movies for users of this movie > c<-data.table(dt[dt$userid==a$userid,list(userid,itemid,rating)]) > > # get all movies for other users who rated this movie > d<- data.table(dt[dt$userid==b$userid ,list(userid,itemid,rating)]) > > setkey(c,itemid) > setkey(d,itemid) > > > e<-c[d,] > names(e)<-c("itemid", "userid0", "rating0", "userid", "rating") > > #get all users where share with thi user and where rating >=4 > f<-e[ abs(rating0-rating)>=4,list(userid)] > > # get avg rating for these users on this movie > dt[i,]$meanDisimilar <- dt[dt$userid==f$userid & > dt$itemid==a$itemid,mean(rating)] > dt[i,]$countDisimilar <- dt[dt$userid==f$userid & > dt$itemid==a$itemid,nrow(rating)] > #print(dt[dt$userid==f$userid & dt$itemid==a$itemid,mean(rating)]); > } > ) > > is there a more elegant compact way to do it. > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
