I agree  that sparse matrices need to be supported as one of the main 
properties inherent to the user/item rating matrix in recommender systems is 
its sparsity. This sparsity is what has given rise to such a large scale of 
research in the field. Hence this property would have to be taken advantage of 
because if not, since we have to deal with matrices, similarity calculations 
would have complexity through the roof (although there are ways to overcome 
this by using item-item cf techniques where similarity calculation is done 
offline but nevertheless is still expensive).

Possibly solutions in my opinion:
   1> Support dense and sparse matrices but I am not sure if such an 
implementation can be directly plugged into sklearn (because of the sparse 
matrix support.)

2> Distributed recommender systems (just provide the ability for people to 
distribute their similarity calculations.) This can be done using MRJob a 
hadoop-streaming wrapper for python. This is also a current field of research 
and I'm sure if you look into it you will find quite a lot of literature on the 
topic.

3> I am currently also trying to look into this library called scikit-crab 
which was started based upon a similar plan but I heard the developers are 
rewriting the library currently and it might not be open to the community for 
active development at present (not sure about this though). But I just 
mentioned it thinking maybe if you took a look at the code, you would get some 
more ideas about what improvements could be made. 
https://github.com/muricoca/crab

________________________________
From: Kyle Kastner [kastnerk...@gmail.com]
Sent: Wednesday, January 15, 2014 1:42 PM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems (from talking 
to Jake IIRC) is how to handle the "missing values" in the input array. You 
would either need a mask, or some kind of indexing system for describing which 
value goes where in the input matrix. Either way, this extra argument would be 
a requirement for CF, but not for the existing algorithms in sklearn.

Maybe it would only operate on sparse arrays, and infer that the values which 
are missing are the ones to be imputed ("recommended")? But not supporting 
dense arrays would basically be the opposite of other modules in sklearn, where 
dense input is default. Maybe someone can comment on this?

I don't know how well this lines up with the existing API/functionality and the 
future directions there, but how to deal with the missing values is probably 
the primary concern for implementing CF algorithms in sklearn IMO.


On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar 
<manojkumarsivaraj...@gmail.com<mailto:manojkumarsivaraj...@gmail.com>> wrote:
Hello,

First of all, thanks to the scikit-learn community for guiding new developers. 
I'm thankful for all the help that I've got with my Pull Requests till now.

I hope that this is the right place to discuss GSoC related ideas (I've idled 
at the scikit-learn irc channel for quite a few occasions, but I could not meet 
any core developer). I was browsing through the threads of last year, when I 
found this idea related to collaborative filtering (CF) quite interesting, 
http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though this 
was sadly not accepted.

If the scikit-learn community is still enthusiastic about a recsys module with 
CF algorithms implemented, I would love this to be my GSoC proposal and we 
could discuss more about the algorithms, gelling with the present sklearn API, 
how much we could possibly fit in a 3 month period etc.

Awaiting a reply.

--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to