Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Mathieu Blondel Fri, 20 Jun 2014 15:37:27 -0700

+1 to starting a separate project in order to receive early feedback.

Besides popularity and number of citations, an issue is that our API
doesn't currently support instance reduction. We need to decide whether to
introduce a new method (e.g., "reduce" as you did) or use fit_transform (so
far fit_transform only affected n_features, not n_samples). We also need to
carefully consider how this would play with pipelines, grid search, etc.


Mathieu


On Sat, Jun 21, 2014 at 3:12 AM, Joel Nothman <[email protected]>
wrote:

> Hi Dayvid,
>
> For now, a number of projects that follow the scikit-learn interface but
> for one reason or another (often just out of scope) are listed at
> https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
> .
>
> I would recommend against keeping everything in a scikit-learn fork. It
> would be much cleaner to maintain. You want your users to download your
> package separately from PyPI, and use it with the current version of
> scikit-learn, not whatever you happen to have in your repository.
>
> The easiest way to do that is just create a new github project and copy
> your files across.
>
> Whether or not these algorithms become included in the main scikit-learn
> repository depends to a large extent on how popular they are, e.g. if their
> seminal works have a lot of citation growth (and perhaps I've
> underestimated its importance already). At a minimum, to show that the
> techniques are worth including, you will need to show evidence that they
> are popular in the community, and provide examples
> <http://scikit-learn.org/stable/auto_examples> to show the implementation
> is effective in practice, and explain why they do not depart from
> scikit-learn's functional scope (which can't afford to creep due to
> maintenance costs).
>
> Given that the technique seems to be KNN-focused, it could be incorporated
> as an extension to the KNN classifier/regressor classes, by providing an
> option to reduce the population when fitting, rather than a set of separate
> estimators.
>
>
> On 20 June 2014 09:30, Dayvid Victor <[email protected]> wrote:
>
>> Hi Joel,
>>
>> Thanks for your feedback. Let me see if I got this straight,
>> you think I should open a new repository and then add an entry
>> in the Wiki?
>>
>> Do you have an example of some other project that did the same?
>> How do I organize it, do I start a new project or I build a new project
>> inside my sklearn fork?
>>
>> Also, do you think in the future it might be included in the
>> scikits-learn repository?
>>
>> Thanks,
>>
>>
>>
>> On Thu, Jun 19, 2014 at 4:07 PM, Joel Nothman <[email protected]>
>> wrote:
>>
>>> PS: Kyle, from a brief look, I would summarise it as sampling a small
>>> set of KNN centroids.
>>>
>>>
>>> On 19 June 2014 12:05, Joel Nothman <[email protected]> wrote:
>>>
>>>> Hi Dayvid,
>>>>
>>>> Although it could potentially be included in scikit-learn, it looks
>>>> like your components do not require modifying the existing codebase, and
>>>> could be construed as an entirely independent project. This could be
>>>> referenced from the Scikit-learn Wiki or similar, without having to decide
>>>> whether these are within the scope (besides quality of code, testing,
>>>> documentation and exemplification) for inclusion in the main project. Given
>>>> the current burden on scikit-learn devs (e.g. 171 open pull requests),
>>>> including a somewhat orthogonal set of algorithms (which may not be seminal
>>>> enough for inclusion anyway) seems unlikely at present. I suggest you make
>>>> the instance reduction directory a separate repository, and invite comment
>>>> there.
>>>>
>>>> Cheers,
>>>>
>>>> - Joel
>>>>
>>>>
>>>> On 19 June 2014 11:46, Dayvid Victor <[email protected]> wrote:
>>>>
>>>>> Hi Kyle,
>>>>>
>>>>> (sorry for the long answer).
>>>>>
>>>>> Instance Reduction techniques aims to reduce the amount of data
>>>>> manipulated in order
>>>>> to perform a classification/prediction ...
>>>>>
>>>>> Depending on the approach, they can remove noisy-data and outliers,
>>>>> remove redundant data, generate new generalized data by combining existing
>>>>> data. Undersampling, Oversampling, Hybrid-Sampling for Imbalanced 
>>>>> Datasets.
>>>>>
>>>>> This figure is from a paper I was co-author 2 years ago, and I think
>>>>> it shows what Instance Reduction does:
>>>>>
>>>>> https://dl.dropboxusercontent.com/u/23695780/asgp.png
>>>>>
>>>>>
>>>>> These are two surveys that presents a taxonomy of prototype selection
>>>>> and prototype generation, they do not have many citations (arround 100),
>>>>> but they do reference the most important techniques and papers until 2012.
>>>>>
>>>>>    - S. García, J. Derrac, J. R. Cano, and F. Herrera, “Prototype
>>>>>    selection for nearest neighbor classification: Taxonomy and empirical
>>>>>    study,” Pattern Analysis and Machine Intelligence, IEEE Transactions 
>>>>> on,
>>>>>    vol. 34, no. 3, pp. 417–435, 2012.
>>>>>    - I. Triguero, J. Derrac, S. García, and F. Herrera, “A taxonomy
>>>>>    and experimental study on prototype generation for nearest neighbor
>>>>>    classification,” Systems, Man, and Cybernetics, Part C: Applications 
>>>>> and
>>>>>    Reviews, IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012
>>>>>
>>>>>
>>>>> Actually, the idea is to contribute with the main ideas that handles:
>>>>> High Noisy-Data, Outliers, (Edition, Condensation and Hybrid approaches);
>>>>> Only the important techniques of each.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Wed, Jun 18, 2014 at 2:45 PM, Kyle Kastner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Do you have any references for this technique? What is it typically
>>>>>> used for?
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 18, 2014 at 12:26 PM, Dayvid Victor <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> Is anybody working on an Instance Reduction module for sklearn?
>>>>>>>
>>>>>>> I started working on those and I already have more than 10 IR (PS
>>>>>>> and PG)
>>>>>>> algorithms implemented (only 7 are in the repository right now);
>>>>>>>
>>>>>>> If anybody is working on this, I'd love to help you out; If not, but
>>>>>>> anybody want to contribute to this module, please, reach me in my email!
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/dvro/scikit-learn/tree/instance_reduction/sklearn/instance_reduction
>>>>>>>
>>>>>>> Thanks,
>>>>>>> --
>>>>>>> Dayvid Victor R. de Oliveira
>>>>>>> PhD Candidate in Computer Science at Federal University of
>>>>>>> Pernambuco (UFPE)
>>>>>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>>>>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>>>> Solutions
>>>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>>>> http://p.sf.net/sfu/hpccsystems
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>>> Solutions
>>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>>> http://p.sf.net/sfu/hpccsystems
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Dayvid Victor R. de Oliveira*
>>>>> PhD Candidate in Computer Science at Federal University of Pernambuco
>>>>> (UFPE)
>>>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>> Solutions
>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>> http://p.sf.net/sfu/hpccsystems
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>> Find What Matters Most in Your Big Data with HPCC Systems
>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>> http://p.sf.net/sfu/hpccsystems
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> *Dayvid Victor R. de Oliveira*
>> PhD Candidate in Computer Science at Federal University of Pernambuco
>> (UFPE)
>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Reply via email to