Re: [Scikit-learn-general] Google Summer of Code 2014

Manoj Kumar Thu, 16 Jan 2014 11:16:28 -0800

Yes indeed, getting two parameters for predict would be specific to CF.
That was the most obvious idea that came to my mind. I would like to hear
other's opinions also regarding the API, and the feasibility of such a
project.



On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <kastnerk...@gmail.com>wrote:

> @Manoj
> The predict stage taking 2 parameters is what I was talking about - are
> there any other estimators that need anything more than a single matrix to
> do a prediction? I do not recall any - this would be something particular
> to CF. Maybe you could recast it as a matrix with alternating rows of
> item,rating but that is still a particular for CF.
>
> Whether that is OK as far sklearn's API is concerned is not for me to
> decide. I would also expect it to be closely tied with DictVectorizer or
> something like it, probably more so than most other algorithms (though this
> is not a big deal IMO) to get categorical labels.
>
> @nmuralid
> I agree totally - last number I saw was that the typical matrix for
> something like Amazon is 99% sparse? I don't remember where I read it
> though. Looking at crab, it seems like they are trying to do sklearn-style
> API specifically for collaborative filtering. Not sure where the name crab
> comes in, but it is definitely worth looking at.
>
> Kyle
>
>
> On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu <
> nmura...@masonlive.gmu.edu> wrote:
>
>>  I agree  that sparse matrices need to be supported as one of the main
>> properties inherent to the user/item rating matrix in recommender systems
>> is its sparsity. This sparsity is what has given rise to such a large scale
>> of research in the field. Hence this property would have to be taken
>> advantage of because if not, since we have to deal with matrices,
>> similarity calculations would have complexity through the roof (although
>> there are ways to overcome this by using item-item cf techniques where
>> similarity calculation is done offline but nevertheless is still
>> expensive).
>>
>>  Possibly solutions in my opinion:
>>    1> Support dense and sparse matrices but I am not sure if such an
>> implementation can be directly plugged into sklearn (because of the sparse
>> matrix support.)
>>
>>  2> Distributed recommender systems (just provide the ability for people
>> to distribute their similarity calculations.) This can be done using MRJob
>> a hadoop-streaming wrapper for python. This is also a current field of
>> research and I'm sure if you look into it you will find quite a lot of
>> literature on the topic.
>>
>>  3> I am currently also trying to look into this library called
>> scikit-crab which was started based upon a similar plan but I heard the
>> developers are rewriting the library currently and it might not be open to
>> the community for active development at present (not sure about this
>> though). But I just mentioned it thinking maybe if you took a look at the
>> code, you would get some more ideas about what improvements could be made.
>> https://github.com/muricoca/crab
>>
>>   ------------------------------
>> *From:* Kyle Kastner [kastnerk...@gmail.com]
>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>> *To:* scikit-learn-general@lists.sourceforge.net
>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>
>>    I looked into this once upon a time, and one of the key problems
>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>> input array. You would either need a mask, or some kind of indexing system
>> for describing which value goes where in the input matrix. Either way, this
>> extra argument would be a requirement for CF, but not for the existing
>> algorithms in sklearn.
>>
>>  Maybe it would only operate on sparse arrays, and infer that the values
>> which are missing are the ones to be imputed ("recommended")? But not
>> supporting dense arrays would basically be the opposite of other modules in
>> sklearn, where dense input is default. Maybe someone can comment on this?
>>
>>  I don't know how well this lines up with the existing API/functionality
>> and the future directions there, but how to deal with the missing values is
>> probably the primary concern for implementing CF algorithms in sklearn IMO.
>>
>>
>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>> manojkumarsivaraj...@gmail.com> wrote:
>>
>>>   Hello,
>>>
>>>  First of all, thanks to the scikit-learn community for guiding new
>>> developers. I'm thankful for all the help that I've got with my Pull
>>> Requests till now.
>>>
>>>  I hope that this is the right place to discuss GSoC related ideas (I've
>>> idled at the scikit-learn irc channel for quite a few occasions, but I
>>> could not meet any core developer). I was browsing through the threads of
>>> last year, when I found this idea related to collaborative filtering (CF)
>>> quite interesting,
>>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
>>> this was sadly not accepted.
>>>
>>>  If the scikit-learn community is still enthusiastic about a recsys
>>> module with CF algorithms implemented, I would love this to be my GSoC
>>> proposal and we could discuss more about the algorithms, gelling with the
>>> present sklearn API, how much we could possibly fit in a 3 month period etc.
>>>
>>>  Awaiting a reply.
>>>
>>> --
>>> Regards,
>>> Manoj Kumar,
>>> Mech Undergrad
>>> http://manojbits.wordpress.com
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Google Summer of Code 2014

Reply via email to