PS: Kyle, from a brief look, I would summarise it as sampling a small set
of KNN centroids.


On 19 June 2014 12:05, Joel Nothman <[email protected]> wrote:

> Hi Dayvid,
>
> Although it could potentially be included in scikit-learn, it looks like
> your components do not require modifying the existing codebase, and could
> be construed as an entirely independent project. This could be referenced
> from the Scikit-learn Wiki or similar, without having to decide whether
> these are within the scope (besides quality of code, testing, documentation
> and exemplification) for inclusion in the main project. Given the current
> burden on scikit-learn devs (e.g. 171 open pull requests), including a
> somewhat orthogonal set of algorithms (which may not be seminal enough for
> inclusion anyway) seems unlikely at present. I suggest you make the
> instance reduction directory a separate repository, and invite comment
> there.
>
> Cheers,
>
> - Joel
>
>
> On 19 June 2014 11:46, Dayvid Victor <[email protected]> wrote:
>
>> Hi Kyle,
>>
>> (sorry for the long answer).
>>
>> Instance Reduction techniques aims to reduce the amount of data
>> manipulated in order
>> to perform a classification/prediction ...
>>
>> Depending on the approach, they can remove noisy-data and outliers,
>> remove redundant data, generate new generalized data by combining existing
>> data. Undersampling, Oversampling, Hybrid-Sampling for Imbalanced Datasets.
>>
>> This figure is from a paper I was co-author 2 years ago, and I think it
>> shows what Instance Reduction does:
>>
>> https://dl.dropboxusercontent.com/u/23695780/asgp.png
>>
>>
>> These are two surveys that presents a taxonomy of prototype selection and
>> prototype generation, they do not have many citations (arround 100), but
>> they do reference the most important techniques and papers until 2012.
>>
>>    - S. García, J. Derrac, J. R. Cano, and F. Herrera, “Prototype
>>    selection for nearest neighbor classification: Taxonomy and empirical
>>    study,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,
>>    vol. 34, no. 3, pp. 417–435, 2012.
>>    - I. Triguero, J. Derrac, S. García, and F. Herrera, “A taxonomy and
>>    experimental study on prototype generation for nearest neighbor
>>    classification,” Systems, Man, and Cybernetics, Part C: Applications and
>>    Reviews, IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012
>>
>>
>> Actually, the idea is to contribute with the main ideas that handles:
>> High Noisy-Data, Outliers, (Edition, Condensation and Hybrid approaches);
>> Only the important techniques of each.
>>
>>
>> Thanks,
>>
>>
>> On Wed, Jun 18, 2014 at 2:45 PM, Kyle Kastner <[email protected]>
>> wrote:
>>
>>> Do you have any references for this technique? What is it typically used
>>> for?
>>>
>>>
>>> On Wed, Jun 18, 2014 at 12:26 PM, Dayvid Victor <[email protected]>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> Is anybody working on an Instance Reduction module for sklearn?
>>>>
>>>> I started working on those and I already have more than 10 IR (PS and
>>>> PG)
>>>> algorithms implemented (only 7 are in the repository right now);
>>>>
>>>> If anybody is working on this, I'd love to help you out; If not, but
>>>> anybody want to contribute to this module, please, reach me in my email!
>>>>
>>>>
>>>> https://github.com/dvro/scikit-learn/tree/instance_reduction/sklearn/instance_reduction
>>>>
>>>> Thanks,
>>>> --
>>>> Dayvid Victor R. de Oliveira
>>>> PhD Candidate in Computer Science at Federal University of Pernambuco
>>>> (UFPE)
>>>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>>>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> http://p.sf.net/sfu/hpccsystems
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>> Find What Matters Most in Your Big Data with HPCC Systems
>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>> http://p.sf.net/sfu/hpccsystems
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> *Dayvid Victor R. de Oliveira*
>> PhD Candidate in Computer Science at Federal University of Pernambuco
>> (UFPE)
>> MSc in Computer Science at Federal University of Pernambuco (UFPE)
>> BSc in Computer Engineering - Federal University of Pernambuco (UFPE)
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to