The other thing to keep mind an ideal solution would be compatible with
Pipeline() - it would be nice to be able to use it there, which is one of
the reasons a different signature for the predict() method is an issue.

Hopefully something can be figured out, as there is a lot interest in CF
algorithms, and a large majority of the algorithmic work (at least for the
CF algorithm I looked at) is already present in the NMF code.


On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <manojkumarsivaraj...@gmail.com
> wrote:

> Yes indeed, getting two parameters for predict would be specific to CF.
> That was the most obvious idea that came to my mind. I would like to hear
> other's opinions also regarding the API, and the feasibility of such a
> project.
>
>
> On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <kastnerk...@gmail.com>wrote:
>
>> @Manoj
>> The predict stage taking 2 parameters is what I was talking about - are
>> there any other estimators that need anything more than a single matrix to
>> do a prediction? I do not recall any - this would be something particular
>> to CF. Maybe you could recast it as a matrix with alternating rows of
>> item,rating but that is still a particular for CF.
>>
>> Whether that is OK as far sklearn's API is concerned is not for me to
>> decide. I would also expect it to be closely tied with DictVectorizer or
>> something like it, probably more so than most other algorithms (though this
>> is not a big deal IMO) to get categorical labels.
>>
>> @nmuralid
>> I agree totally - last number I saw was that the typical matrix for
>> something like Amazon is 99% sparse? I don't remember where I read it
>> though. Looking at crab, it seems like they are trying to do sklearn-style
>> API specifically for collaborative filtering. Not sure where the name crab
>> comes in, but it is definitely worth looking at.
>>
>> Kyle
>>
>>
>> On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu <
>> nmura...@masonlive.gmu.edu> wrote:
>>
>>>  I agree  that sparse matrices need to be supported as one of the main
>>> properties inherent to the user/item rating matrix in recommender systems
>>> is its sparsity. This sparsity is what has given rise to such a large scale
>>> of research in the field. Hence this property would have to be taken
>>> advantage of because if not, since we have to deal with matrices,
>>> similarity calculations would have complexity through the roof (although
>>> there are ways to overcome this by using item-item cf techniques where
>>> similarity calculation is done offline but nevertheless is still
>>> expensive).
>>>
>>>  Possibly solutions in my opinion:
>>>    1> Support dense and sparse matrices but I am not sure if such an
>>> implementation can be directly plugged into sklearn (because of the sparse
>>> matrix support.)
>>>
>>>  2> Distributed recommender systems (just provide the ability for
>>> people to distribute their similarity calculations.) This can be done using
>>> MRJob a hadoop-streaming wrapper for python. This is also a current field
>>> of research and I'm sure if you look into it you will find quite a lot of
>>> literature on the topic.
>>>
>>>  3> I am currently also trying to look into this library called
>>> scikit-crab which was started based upon a similar plan but I heard the
>>> developers are rewriting the library currently and it might not be open to
>>> the community for active development at present (not sure about this
>>> though). But I just mentioned it thinking maybe if you took a look at the
>>> code, you would get some more ideas about what improvements could be made.
>>> https://github.com/muricoca/crab
>>>
>>>   ------------------------------
>>> *From:* Kyle Kastner [kastnerk...@gmail.com]
>>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>>> *To:* scikit-learn-general@lists.sourceforge.net
>>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>>
>>>    I looked into this once upon a time, and one of the key problems
>>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>>> input array. You would either need a mask, or some kind of indexing system
>>> for describing which value goes where in the input matrix. Either way, this
>>> extra argument would be a requirement for CF, but not for the existing
>>> algorithms in sklearn.
>>>
>>>  Maybe it would only operate on sparse arrays, and infer that the values
>>> which are missing are the ones to be imputed ("recommended")? But not
>>> supporting dense arrays would basically be the opposite of other modules in
>>> sklearn, where dense input is default. Maybe someone can comment on this?
>>>
>>>  I don't know how well this lines up with the existing API/functionality
>>> and the future directions there, but how to deal with the missing values is
>>> probably the primary concern for implementing CF algorithms in sklearn IMO.
>>>
>>>
>>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>>> manojkumarsivaraj...@gmail.com> wrote:
>>>
>>>>   Hello,
>>>>
>>>>  First of all, thanks to the scikit-learn community for guiding new
>>>> developers. I'm thankful for all the help that I've got with my Pull
>>>> Requests till now.
>>>>
>>>>  I hope that this is the right place to discuss GSoC related ideas
>>>> (I've idled at the scikit-learn irc channel for quite a few occasions, but
>>>> I could not meet any core developer). I was browsing through the threads of
>>>> last year, when I found this idea related to collaborative filtering (CF)
>>>> quite interesting,
>>>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 ,
>>>> though this was sadly not accepted.
>>>>
>>>>  If the scikit-learn community is still enthusiastic about a recsys
>>>> module with CF algorithms implemented, I would love this to be my GSoC
>>>> proposal and we could discuss more about the algorithms, gelling with the
>>>> present sklearn API, how much we could possibly fit in a 3 month period 
>>>> etc.
>>>>
>>>>  Awaiting a reply.
>>>>
>>>> --
>>>> Regards,
>>>> Manoj Kumar,
>>>> Mech Undergrad
>>>> http://manojbits.wordpress.com
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>> Critical Workloads, Development Environments & Everything In Between.
>>>> Get a Quote or Start a Free Trial Today.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Regards,
> Manoj Kumar,
> Mech Undergrad
> http://manojbits.wordpress.com
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to