`y` is by definition hidden at prediction time for supervised learning, so
I don't think your representation makes sense. But I see this as a
completion problem, not a supervised learning problem: the same data is
observed at training and predict time.
Isn't the following:
X = [["ham", "spam"], ["ram", "bam", "tam"]], and y = [[2, 3], [1, -3, 4]]
equivalent to [{'ham': 2, 'spam': 3}, {'ram': 2, 'bam': -3, 'tam': 4}]?
Via DictVectorizer, this becomes equivalent to a sparse COO matrix with:
col = [0, 1, 2, 3, 4]
row = [0, 0, 1, 1, 1]
data = [2, 3, 2, -3, 4]
As far as I can tell, this is a plain old sparse matrix, without a need for
an extra `y`. (Please convince me otherwise!)
There are still issues of whether this is in scikit-learn scope. For
example, does it make sense with sklearn's cross validation? Or will you
want to cross validate on both axes? Given that there is plenty of work to
be done that is well within scikit-learn's scope (prominent alternative
solutions and utilities for problems it already solves), I think this
extension of scope needs to be argued.
On 17 January 2014 09:24, Kyle Kastner <kastnerk...@gmail.com> wrote:
> The other thing to keep mind an ideal solution would be compatible with
> Pipeline() - it would be nice to be able to use it there, which is one of
> the reasons a different signature for the predict() method is an issue.
>
> Hopefully something can be figured out, as there is a lot interest in CF
> algorithms, and a large majority of the algorithmic work (at least for the
> CF algorithm I looked at) is already present in the NMF code.
>
>
> On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar <
> manojkumarsivaraj...@gmail.com> wrote:
>
>> Yes indeed, getting two parameters for predict would be specific to CF.
>> That was the most obvious idea that came to my mind. I would like to hear
>> other's opinions also regarding the API, and the feasibility of such a
>> project.
>>
>>
>> On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner <kastnerk...@gmail.com>wrote:
>>
>>> @Manoj
>>> The predict stage taking 2 parameters is what I was talking about - are
>>> there any other estimators that need anything more than a single matrix to
>>> do a prediction? I do not recall any - this would be something particular
>>> to CF. Maybe you could recast it as a matrix with alternating rows of
>>> item,rating but that is still a particular for CF.
>>>
>>> Whether that is OK as far sklearn's API is concerned is not for me to
>>> decide. I would also expect it to be closely tied with DictVectorizer or
>>> something like it, probably more so than most other algorithms (though this
>>> is not a big deal IMO) to get categorical labels.
>>>
>>> @nmuralid
>>> I agree totally - last number I saw was that the typical matrix for
>>> something like Amazon is 99% sparse? I don't remember where I read it
>>> though. Looking at crab, it seems like they are trying to do sklearn-style
>>> API specifically for collaborative filtering. Not sure where the name crab
>>> comes in, but it is definitely worth looking at.
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu <
>>> nmura...@masonlive.gmu.edu> wrote:
>>>
>>>> I agree that sparse matrices need to be supported as one of the main
>>>> properties inherent to the user/item rating matrix in recommender systems
>>>> is its sparsity. This sparsity is what has given rise to such a large scale
>>>> of research in the field. Hence this property would have to be taken
>>>> advantage of because if not, since we have to deal with matrices,
>>>> similarity calculations would have complexity through the roof (although
>>>> there are ways to overcome this by using item-item cf techniques where
>>>> similarity calculation is done offline but nevertheless is still
>>>> expensive).
>>>>
>>>> Possibly solutions in my opinion:
>>>> 1> Support dense and sparse matrices but I am not sure if such an
>>>> implementation can be directly plugged into sklearn (because of the sparse
>>>> matrix support.)
>>>>
>>>> 2> Distributed recommender systems (just provide the ability for
>>>> people to distribute their similarity calculations.) This can be done using
>>>> MRJob a hadoop-streaming wrapper for python. This is also a current field
>>>> of research and I'm sure if you look into it you will find quite a lot of
>>>> literature on the topic.
>>>>
>>>> 3> I am currently also trying to look into this library called
>>>> scikit-crab which was started based upon a similar plan but I heard the
>>>> developers are rewriting the library currently and it might not be open to
>>>> the community for active development at present (not sure about this
>>>> though). But I just mentioned it thinking maybe if you took a look at the
>>>> code, you would get some more ideas about what improvements could be made.
>>>> https://github.com/muricoca/crab
>>>>
>>>> ------------------------------
>>>> *From:* Kyle Kastner [kastnerk...@gmail.com]
>>>> *Sent:* Wednesday, January 15, 2014 1:42 PM
>>>> *To:* scikit-learn-general@lists.sourceforge.net
>>>> *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014
>>>>
>>>> I looked into this once upon a time, and one of the key problems
>>>> (from talking to Jake IIRC) is how to handle the "missing values" in the
>>>> input array. You would either need a mask, or some kind of indexing system
>>>> for describing which value goes where in the input matrix. Either way, this
>>>> extra argument would be a requirement for CF, but not for the existing
>>>> algorithms in sklearn.
>>>>
>>>> Maybe it would only operate on sparse arrays, and infer that the
>>>> values which are missing are the ones to be imputed ("recommended")? But
>>>> not supporting dense arrays would basically be the opposite of other
>>>> modules in sklearn, where dense input is default. Maybe someone can comment
>>>> on this?
>>>>
>>>> I don't know how well this lines up with the existing
>>>> API/functionality and the future directions there, but how to deal with the
>>>> missing values is probably the primary concern for implementing CF
>>>> algorithms in sklearn IMO.
>>>>
>>>>
>>>> On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar <
>>>> manojkumarsivaraj...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> First of all, thanks to the scikit-learn community for guiding new
>>>>> developers. I'm thankful for all the help that I've got with my Pull
>>>>> Requests till now.
>>>>>
>>>>> I hope that this is the right place to discuss GSoC related ideas
>>>>> (I've idled at the scikit-learn irc channel for quite a few occasions, but
>>>>> I could not meet any core developer). I was browsing through the threads
>>>>> of
>>>>> last year, when I found this idea related to collaborative filtering (CF)
>>>>> quite interesting,
>>>>> http://sourceforge.net/mailarchive/message.php?msg_id=30725712 ,
>>>>> though this was sadly not accepted.
>>>>>
>>>>> If the scikit-learn community is still enthusiastic about a recsys
>>>>> module with CF algorithms implemented, I would love this to be my GSoC
>>>>> proposal and we could discuss more about the algorithms, gelling with the
>>>>> present sklearn API, how much we could possibly fit in a 3 month period
>>>>> etc.
>>>>>
>>>>> Awaiting a reply.
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Manoj Kumar,
>>>>> Mech Undergrad
>>>>> http://manojbits.wordpress.com
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>>> Critical Workloads, Development Environments & Everything In Between.
>>>>> Get a Quote or Start a Free Trial Today.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>> Critical Workloads, Development Environments & Everything In Between.
>>>> Get a Quote or Start a Free Trial Today.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Regards,
>> Manoj Kumar,
>> Mech Undergrad
>> http://manojbits.wordpress.com
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general