Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Dayvid Victor Thu, 19 Jun 2014 11:48:06 -0700

Hi Kyle,

(sorry for the long answer).


Instance Reduction techniques aims to reduce the amount of data manipulated
in order
to perform a classification/prediction ...

Depending on the approach, they can remove noisy-data and outliers, remove
redundant data, generate new generalized data by combining existing data.
Undersampling, Oversampling, Hybrid-Sampling for Imbalanced Datasets.

This figure is from a paper I was co-author 2 years ago, and I think it
shows what Instance Reduction does:

https://dl.dropboxusercontent.com/u/23695780/asgp.png


These are two surveys that presents a taxonomy of prototype selection and
prototype generation, they do not have many citations (arround 100), but
they do reference the most important techniques and papers until 2012.

   - S. García, J. Derrac, J. R. Cano, and F. Herrera, “Prototype selection
   for nearest neighbor classification: Taxonomy and empirical study,” Pattern
   Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 3,
   pp. 417–435, 2012.
   - I. Triguero, J. Derrac, S. García, and F. Herrera, “A taxonomy and
   experimental study on prototype generation for nearest neighbor
   classification,” Systems, Man, and Cybernetics, Part C: Applications and
   Reviews, IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012


Actually, the idea is to contribute with the main ideas that handles: High
Noisy-Data, Outliers, (Edition, Condensation and Hybrid approaches); Only
the important techniques of each.


Thanks,


On Wed, Jun 18, 2014 at 2:45 PM, Kyle Kastner <[email protected]> wrote:

> Do you have any references for this technique? What is it typically used
> for?
>
>
> On Wed, Jun 18, 2014 at 12:26 PM, Dayvid Victor <[email protected]>
> wrote:
>
>> Hi there,
>>
>> Is anybody working on an Instance Reduction module for sklearn?
>>
>> I started working on those and I already have more than 10 IR (PS and PG)
>> algorithms implemented (only 7 are in the repository right now);
>>
>> If anybody is working on this, I'd love to help you out; If not, but
>> anybody want to contribute to this module, please, reach me in my email!
>>
>>
>> https://github.com/dvro/scikit-learn/tree/instance_reduction/sklearn/instance_reduction
>>
>> Thanks,
>> --
>> Dayvid Victor R. de Oliveira
>> PhD Candidate in Computer Science at Federal University of Pernambuco
>> (UFPE)
>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
*Dayvid Victor R. de Oliveira*
PhD Candidate in Computer Science at Federal University of Pernambuco (UFPE)
MSc in Computer Science at Federal University of Pernambuco (UFPE)
BSc in Computer Engineering - Federal University of Pernambuco (UFPE)

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Instance Reduction on scikit-learn

Reply via email to