Hi Dayvid,

Although it could potentially be included in scikit-learn, it looks like
your components do not require modifying the existing codebase, and could
be construed as an entirely independent project. This could be referenced
from the Scikit-learn Wiki or similar, without having to decide whether
these are within the scope (besides quality of code, testing, documentation
and exemplification) for inclusion in the main project. Given the current
burden on scikit-learn devs (e.g. 171 open pull requests), including a
somewhat orthogonal set of algorithms (which may not be seminal enough for
inclusion anyway) seems unlikely at present. I suggest you make the
instance reduction directory a separate repository, and invite comment
there.

Cheers,

- Joel


On 19 June 2014 11:46, Dayvid Victor <[email protected]> wrote:

> Hi Kyle,
>
> (sorry for the long answer).
>
> Instance Reduction techniques aims to reduce the amount of data
> manipulated in order
> to perform a classification/prediction ...
>
> Depending on the approach, they can remove noisy-data and outliers, remove
> redundant data, generate new generalized data by combining existing data.
> Undersampling, Oversampling, Hybrid-Sampling for Imbalanced Datasets.
>
> This figure is from a paper I was co-author 2 years ago, and I think it
> shows what Instance Reduction does:
>
> https://dl.dropboxusercontent.com/u/23695780/asgp.png
>
>
> These are two surveys that presents a taxonomy of prototype selection and
> prototype generation, they do not have many citations (arround 100), but
> they do reference the most important techniques and papers until 2012.
>
>    - S. García, J. Derrac, J. R. Cano, and F. Herrera, “Prototype
>    selection for nearest neighbor classification: Taxonomy and empirical
>    study,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,
>    vol. 34, no. 3, pp. 417–435, 2012.
>    - I. Triguero, J. Derrac, S. García, and F. Herrera, “A taxonomy and
>    experimental study on prototype generation for nearest neighbor
>    classification,” Systems, Man, and Cybernetics, Part C: Applications and
>    Reviews, IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012
>
>
> Actually, the idea is to contribute with the main ideas that handles: High
> Noisy-Data, Outliers, (Edition, Condensation and Hybrid approaches); Only
> the important techniques of each.
>
>
> Thanks,
>
>
> On Wed, Jun 18, 2014 at 2:45 PM, Kyle Kastner <[email protected]>
> wrote:
>
>> Do you have any references for this technique? What is it typically used
>> for?
>>
>>
>> On Wed, Jun 18, 2014 at 12:26 PM, Dayvid Victor <[email protected]>
>> wrote:
>>
>>> Hi there,
>>>
>>> Is anybody working on an Instance Reduction module for sklearn?
>>>
>>> I started working on those and I already have more than 10 IR (PS and PG)
>>> algorithms implemented (only 7 are in the repository right now);
>>>
>>> If anybody is working on this, I'd love to help you out; If not, but
>>> anybody want to contribute to this module, please, reach me in my email!
>>>
>>>
>>> https://github.com/dvro/scikit-learn/tree/instance_reduction/sklearn/instance_reduction
>>>
>>> Thanks,
>>> --
>>> Dayvid Victor R. de Oliveira
>>> PhD Candidate in Computer Science at Federal University of Pernambuco
>>> (UFPE)
>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>> Find What Matters Most in Your Big Data with HPCC Systems
>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>> http://p.sf.net/sfu/hpccsystems
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> *Dayvid Victor R. de Oliveira*
> PhD Candidate in Computer Science at Federal University of Pernambuco
> (UFPE)
> MSc in Computer Science at Federal University of Pernambuco (UFPE)
> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to