On Thu, Oct 30, 2014 at 11:22 AM, Supun Sethunga <[email protected]> wrote:

> Hi Chathura,
>
> Can there be a requirement to maintain subsets of the initial dataset
>> under a project?
>> For example, certain methods of preprocessing or slicing the dataset in
>> different dimensions  could produce multiple subsets of data. Subsequently,
>> we may want to apply various mining algorithms on these resulting subsets
>> of data. As computing such subsets can be costly, it may be beneficial to
>> store precomputed subsets of data under a project.
>
>
> IMHO, i don't see any requirement of keeping subsets of data. If you look
> at the ER diagram, by having multiple "Processes" and "Executions" to a
> single project, this requirement is satisfied.  In a process, users can do
> pre-processing and things like dimensionality reduction +  select a
> training-set.This means a project can have multiple training sets. (The
> actual pre-processing will  NOT be done at this point, because otherwise we
> have to iterate through a fairly large dataset few times, which is an
> overhead. Hence only the configurations will be taken as user inputs and
> the actual pre-processing will be done later as map-reduced jobs.)
>
> Then in "Execution", user can use the previous pre-processing
> configuration and run the model building with a desired algorithm. This
> way, multiple algorithms can be applied on the same training-set.
>
> Apart from that, here we may talking about Gigabytes or even Terabytes of
> data. Thus keeping subsets is not the best thing in anyway.
>
>  Another example is that once a clustering algorithm is applied, users may
>> want to preserve data in certain clusters for further processing. In that
>> case, users can select one or more clusters and generate subsets of data.
>
>
> WSO2 ML will only produce the models, and would NOT facilitate prediction
> using built models. A built model will be published and will be applied by
> CEP/BAM/ESB.
>

Building models (mainly from classification algorithms) and using them in
BAM, ESB, etc. is a very good use case of ML. However, are we only focusing
on classification (and model generation)? I think clustering is also a
major part of data analysis and thus becomes a essential component in a
data analysis product. If we want to support clustering, it may be useful
to persist (or export) clustered data, as it is the main result of a
clustering algorithm.

Clusters (or subsets) of data can be persisted by only recording the
cluster assignments (i.e. data point ID -> cluster ID) without repeating
whole data points. Still there can be large number of mappings for a large
dataset. Therefore, this is a trade off that we have to make carefully, as
clustering may also take considerable time depending on the size and
properties of the dataset.

Regards,
Chathura
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to