Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Joel Nothman Fri, 20 Jun 2014 11:13:42 -0700

Hi Dayvid,

For now, a number of projects that follow the scikit-learn interface but
for one reason or another (often just out of scope) are listed at
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
.


I would recommend against keeping everything in a scikit-learn fork. It
would be much cleaner to maintain. You want your users to download your
package separately from PyPI, and use it with the current version of
scikit-learn, not whatever you happen to have in your repository.

The easiest way to do that is just create a new github project and copy
your files across.

Whether or not these algorithms become included in the main scikit-learn
repository depends to a large extent on how popular they are, e.g. if their
seminal works have a lot of citation growth (and perhaps I've
underestimated its importance already). At a minimum, to show that the
techniques are worth including, you will need to show evidence that they
are popular in the community, and provide examples
<http://scikit-learn.org/stable/auto_examples> to show the implementation
is effective in practice, and explain why they do not depart from
scikit-learn's functional scope (which can't afford to creep due to
maintenance costs).

Given that the technique seems to be KNN-focused, it could be incorporated
as an extension to the KNN classifier/regressor classes, by providing an
option to reduce the population when fitting, rather than a set of separate
estimators.


On 20 June 2014 09:30, Dayvid Victor <[email protected]> wrote:

> Hi Joel,
>
> Thanks for your feedback. Let me see if I got this straight,
> you think I should open a new repository and then add an entry
> in the Wiki?
>
> Do you have an example of some other project that did the same?
> How do I organize it, do I start a new project or I build a new project
> inside my sklearn fork?
>
> Also, do you think in the future it might be included in the
> scikits-learn repository?
>
> Thanks,
>
>
>
> On Thu, Jun 19, 2014 at 4:07 PM, Joel Nothman <[email protected]>
> wrote:
>
>> PS: Kyle, from a brief look, I would summarise it as sampling a small set
>> of KNN centroids.
>>
>>
>> On 19 June 2014 12:05, Joel Nothman <[email protected]> wrote:
>>
>>> Hi Dayvid,
>>>
>>> Although it could potentially be included in scikit-learn, it looks like
>>> your components do not require modifying the existing codebase, and could
>>> be construed as an entirely independent project. This could be referenced
>>> from the Scikit-learn Wiki or similar, without having to decide whether
>>> these are within the scope (besides quality of code, testing, documentation
>>> and exemplification) for inclusion in the main project. Given the current
>>> burden on scikit-learn devs (e.g. 171 open pull requests), including a
>>> somewhat orthogonal set of algorithms (which may not be seminal enough for
>>> inclusion anyway) seems unlikely at present. I suggest you make the
>>> instance reduction directory a separate repository, and invite comment
>>> there.
>>>
>>> Cheers,
>>>
>>> - Joel
>>>
>>>
>>> On 19 June 2014 11:46, Dayvid Victor <[email protected]> wrote:
>>>
>>>> Hi Kyle,
>>>>
>>>> (sorry for the long answer).
>>>>
>>>> Instance Reduction techniques aims to reduce the amount of data
>>>> manipulated in order
>>>> to perform a classification/prediction ...
>>>>
>>>> Depending on the approach, they can remove noisy-data and outliers,
>>>> remove redundant data, generate new generalized data by combining existing
>>>> data. Undersampling, Oversampling, Hybrid-Sampling for Imbalanced Datasets.
>>>>
>>>> This figure is from a paper I was co-author 2 years ago, and I think it
>>>> shows what Instance Reduction does:
>>>>
>>>> https://dl.dropboxusercontent.com/u/23695780/asgp.png
>>>>
>>>>
>>>> These are two surveys that presents a taxonomy of prototype selection
>>>> and prototype generation, they do not have many citations (arround 100),
>>>> but they do reference the most important techniques and papers until 2012.
>>>>
>>>>    - S. García, J. Derrac, J. R. Cano, and F. Herrera, “Prototype
>>>>    selection for nearest neighbor classification: Taxonomy and empirical
>>>>    study,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,
>>>>    vol. 34, no. 3, pp. 417–435, 2012.
>>>>    - I. Triguero, J. Derrac, S. García, and F. Herrera, “A taxonomy
>>>>    and experimental study on prototype generation for nearest neighbor
>>>>    classification,” Systems, Man, and Cybernetics, Part C: Applications and
>>>>    Reviews, IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012
>>>>
>>>>
>>>> Actually, the idea is to contribute with the main ideas that handles:
>>>> High Noisy-Data, Outliers, (Edition, Condensation and Hybrid approaches);
>>>> Only the important techniques of each.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Wed, Jun 18, 2014 at 2:45 PM, Kyle Kastner <[email protected]>
>>>> wrote:
>>>>
>>>>> Do you have any references for this technique? What is it typically
>>>>> used for?
>>>>>
>>>>>
>>>>> On Wed, Jun 18, 2014 at 12:26 PM, Dayvid Victor <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> Is anybody working on an Instance Reduction module for sklearn?
>>>>>>
>>>>>> I started working on those and I already have more than 10 IR (PS and
>>>>>> PG)
>>>>>> algorithms implemented (only 7 are in the repository right now);
>>>>>>
>>>>>> If anybody is working on this, I'd love to help you out; If not, but
>>>>>> anybody want to contribute to this module, please, reach me in my email!
>>>>>>
>>>>>>
>>>>>> https://github.com/dvro/scikit-learn/tree/instance_reduction/sklearn/instance_reduction
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> Dayvid Victor R. de Oliveira
>>>>>> PhD Candidate in Computer Science at Federal University of
>>>>>> Pernambuco (UFPE)
>>>>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>>>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>>> Solutions
>>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>>> http://p.sf.net/sfu/hpccsystems
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>> Solutions
>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>> http://p.sf.net/sfu/hpccsystems
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Dayvid Victor R. de Oliveira*
>>>> PhD Candidate in Computer Science at Federal University of Pernambuco
>>>> (UFPE)
>>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> http://p.sf.net/sfu/hpccsystems
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> *Dayvid Victor R. de Oliveira*
> PhD Candidate in Computer Science at Federal University of Pernambuco
> (UFPE)
> MSc in Computer Science at Federal University of Pernambuco (UFPE)
> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Reply via email to