@Manoj
The predict stage taking 2 parameters is what I was talking about - are
there any other estimators that need anything more than a single matrix to
do a prediction? I do not recall any - this would be something particular
to CF. Maybe you could recast it as a matrix with alternating rows of
item,rating but that is still a particular for CF.

Whether that is OK as far sklearn's API is concerned is not for me to
decide. I would also expect it to be closely tied with DictVectorizer or
something like it, probably more so than most other algorithms (though this
is not a big deal IMO) to get categorical labels.

@nmuralid
I agree totally - last number I saw was that the typical matrix for
something like Amazon is 99% sparse? I don't remember where I read it
though. Looking at crab, it seems like they are trying to do sklearn-style
API specifically for collaborative filtering. Not sure where the name crab
comes in, but it is definitely worth looking at.

Kyle


On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu <
nmura...@masonlive.gmu.edu> wrote:

>  I agree  that sparse matrices need to be supported as one of the main
> properties inherent to the user/item rating matrix in recommender systems
> is its sparsity. This sparsity is what has given rise to such a large scale
> of research in the field. Hence this property would have to be taken
> advantage of because if not, since we have to deal with matrices,
> similarity calculations would have complexity through the roof (although
> there are ways to overcome this by using item-item cf techniques where
> similarity calculation is done offline but nevertheless is still
> expensive).
>
>  Possibly solutions in my opinion:
>    1> Support dense and sparse matrices but I am not sure if such an
> implementation can be directly plugged into sklearn (because of the sparse
> matrix support.)
>
>  2> Distributed recommender systems (just provide the ability for people
> to distribute their similarity calculations.) This can be done using MRJob
> a hadoop-streaming wrapper for python. This is also a current field of
> research and I'm sure if you look into it you will find quite a lot of
> literature on the topic.
>
>  3> I am currently also trying to look into this library called
> scikit-crab which was started based upon a similar plan but I heard the
> developers are rewriting the library currently and it might not be open to
> the community for active development at present (not sure about this
> though). But I just mentioned it thinking maybe if you took a look at the
> code, you would get some more ideas about what improvements could be made.
> https://github.com/muricoca/crab
>
>   ------------------------------
> *From:* Kyle Kastner [kastnerk...@gmail.com]
> *Sent:* Wednesday, January 15, 2014 1:42 PM
> *To:* scikit-learn-general@lists.sourceforge.net
> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>
>    I looked into this once upon a time, and one of the key problems (from
> talking to Jake IIRC) is how to handle the "missing values" in the input
> array. You would either need a mask, or some kind of indexing system for
> describing which value goes where in the input matrix. Either way, this
> extra argument would be a requirement for CF, but not for the existing
> algorithms in sklearn.
>
>  Maybe it would only operate on sparse arrays, and infer that the values
> which are missing are the ones to be imputed ("recommended")? But not
> supporting dense arrays would basically be the opposite of other modules in
> sklearn, where dense input is default. Maybe someone can comment on this?
>
>  I don't know how well this lines up with the existing API/functionality
> and the future directions there, but how to deal with the missing values is
> probably the primary concern for implementing CF algorithms in sklearn IMO.
>
>
> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
> manojkumarsivaraj...@gmail.com> wrote:
>
>>   Hello,
>>
>>  First of all, thanks to the scikit-learn community for guiding new
>> developers. I'm thankful for all the help that I've got with my Pull
>> Requests till now.
>>
>>  I hope that this is the right place to discuss GSoC related ideas (I've
>> idled at the scikit-learn irc channel for quite a few occasions, but I
>> could not meet any core developer). I was browsing through the threads of
>> last year, when I found this idea related to collaborative filtering (CF)
>> quite interesting,
>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
>> this was sadly not accepted.
>>
>>  If the scikit-learn community is still enthusiastic about a recsys
>> module with CF algorithms implemented, I would love this to be my GSoC
>> proposal and we could discuss more about the algorithms, gelling with the
>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>
>>  Awaiting a reply.
>>
>> --
>> Regards,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to