Hi Chathura,

Yes, ML will support clustering as well, and the output will be the
clustered points. Please refer the email thread [1] and the ML workflow in
[2].

[1] [Architecture] Machine Learning Workflow
[2]
https://drive.google.com/a/wso2.com/file/d/0B1KWJ16Pv0V0U3BvcVZuZUR4WE0/view?usp=sharing

Regards,
Supun

On Mon, Nov 3, 2014 at 8:26 AM, Chathura Ekanayake <[email protected]>
wrote:

>
>
> On Thu, Oct 30, 2014 at 11:22 AM, Supun Sethunga <[email protected]> wrote:
>
>> Hi Chathura,
>>
>> Can there be a requirement to maintain subsets of the initial dataset
>>> under a project?
>>> For example, certain methods of preprocessing or slicing the dataset in
>>> different dimensions  could produce multiple subsets of data. Subsequently,
>>> we may want to apply various mining algorithms on these resulting subsets
>>> of data. As computing such subsets can be costly, it may be beneficial to
>>> store precomputed subsets of data under a project.
>>
>>
>> IMHO, i don't see any requirement of keeping subsets of data. If you look
>> at the ER diagram, by having multiple "Processes" and "Executions" to a
>> single project, this requirement is satisfied.  In a process, users can do
>> pre-processing and things like dimensionality reduction +  select a
>> training-set.This means a project can have multiple training sets. (The
>> actual pre-processing will  NOT be done at this point, because otherwise we
>> have to iterate through a fairly large dataset few times, which is an
>> overhead. Hence only the configurations will be taken as user inputs and
>> the actual pre-processing will be done later as map-reduced jobs.)
>>
>> Then in "Execution", user can use the previous pre-processing
>> configuration and run the model building with a desired algorithm. This
>> way, multiple algorithms can be applied on the same training-set.
>>
>> Apart from that, here we may talking about Gigabytes or even Terabytes of
>> data. Thus keeping subsets is not the best thing in anyway.
>>
>>  Another example is that once a clustering algorithm is applied, users
>>> may want to preserve data in certain clusters for further processing. In
>>> that case, users can select one or more clusters and generate subsets of
>>> data.
>>
>>
>> WSO2 ML will only produce the models, and would NOT facilitate prediction
>> using built models. A built model will be published and will be applied by
>> CEP/BAM/ESB.
>>
>
> Building models (mainly from classification algorithms) and using them in
> BAM, ESB, etc. is a very good use case of ML. However, are we only focusing
> on classification (and model generation)? I think clustering is also a
> major part of data analysis and thus becomes a essential component in a
> data analysis product. If we want to support clustering, it may be useful
> to persist (or export) clustered data, as it is the main result of a
> clustering algorithm.
>
> Clusters (or subsets) of data can be persisted by only recording the
> cluster assignments (i.e. data point ID -> cluster ID) without repeating
> whole data points. Still there can be large number of mappings for a large
> dataset. Therefore, this is a trade off that we have to make carefully, as
> clustering may also take considerable time depending on the size and
> properties of the dataset.
>
> Regards,
> Chathura
>
>



-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
lean | enterprise | middleware
Mobile : +94 716546324
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to