Hi Chathura, Yes, ML will support clustering as well, and the output will be the clustered points. Please refer the email thread [1] and the ML workflow in [2].
[1] [Architecture] Machine Learning Workflow [2] https://drive.google.com/a/wso2.com/file/d/0B1KWJ16Pv0V0U3BvcVZuZUR4WE0/view?usp=sharing Regards, Supun On Mon, Nov 3, 2014 at 8:26 AM, Chathura Ekanayake <[email protected]> wrote: > > > On Thu, Oct 30, 2014 at 11:22 AM, Supun Sethunga <[email protected]> wrote: > >> Hi Chathura, >> >> Can there be a requirement to maintain subsets of the initial dataset >>> under a project? >>> For example, certain methods of preprocessing or slicing the dataset in >>> different dimensions could produce multiple subsets of data. Subsequently, >>> we may want to apply various mining algorithms on these resulting subsets >>> of data. As computing such subsets can be costly, it may be beneficial to >>> store precomputed subsets of data under a project. >> >> >> IMHO, i don't see any requirement of keeping subsets of data. If you look >> at the ER diagram, by having multiple "Processes" and "Executions" to a >> single project, this requirement is satisfied. In a process, users can do >> pre-processing and things like dimensionality reduction + select a >> training-set.This means a project can have multiple training sets. (The >> actual pre-processing will NOT be done at this point, because otherwise we >> have to iterate through a fairly large dataset few times, which is an >> overhead. Hence only the configurations will be taken as user inputs and >> the actual pre-processing will be done later as map-reduced jobs.) >> >> Then in "Execution", user can use the previous pre-processing >> configuration and run the model building with a desired algorithm. This >> way, multiple algorithms can be applied on the same training-set. >> >> Apart from that, here we may talking about Gigabytes or even Terabytes of >> data. Thus keeping subsets is not the best thing in anyway. >> >> Another example is that once a clustering algorithm is applied, users >>> may want to preserve data in certain clusters for further processing. In >>> that case, users can select one or more clusters and generate subsets of >>> data. >> >> >> WSO2 ML will only produce the models, and would NOT facilitate prediction >> using built models. A built model will be published and will be applied by >> CEP/BAM/ESB. >> > > Building models (mainly from classification algorithms) and using them in > BAM, ESB, etc. is a very good use case of ML. However, are we only focusing > on classification (and model generation)? I think clustering is also a > major part of data analysis and thus becomes a essential component in a > data analysis product. If we want to support clustering, it may be useful > to persist (or export) clustered data, as it is the main result of a > clustering algorithm. > > Clusters (or subsets) of data can be persisted by only recording the > cluster assignments (i.e. data point ID -> cluster ID) without repeating > whole data points. Still there can be large number of mappings for a large > dataset. Therefore, this is a trade off that we have to make carefully, as > clustering may also take considerable time depending on the size and > properties of the dataset. > > Regards, > Chathura > > -- *Supun Sethunga* Software Engineer WSO2, Inc. lean | enterprise | middleware Mobile : +94 716546324
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
