On Thu, Oct 30, 2014 at 11:22 AM, Supun Sethunga <[email protected]> wrote:
> Hi Chathura, > > Can there be a requirement to maintain subsets of the initial dataset >> under a project? >> For example, certain methods of preprocessing or slicing the dataset in >> different dimensions could produce multiple subsets of data. Subsequently, >> we may want to apply various mining algorithms on these resulting subsets >> of data. As computing such subsets can be costly, it may be beneficial to >> store precomputed subsets of data under a project. > > > IMHO, i don't see any requirement of keeping subsets of data. If you look > at the ER diagram, by having multiple "Processes" and "Executions" to a > single project, this requirement is satisfied. In a process, users can do > pre-processing and things like dimensionality reduction + select a > training-set.This means a project can have multiple training sets. (The > actual pre-processing will NOT be done at this point, because otherwise we > have to iterate through a fairly large dataset few times, which is an > overhead. Hence only the configurations will be taken as user inputs and > the actual pre-processing will be done later as map-reduced jobs.) > > Then in "Execution", user can use the previous pre-processing > configuration and run the model building with a desired algorithm. This > way, multiple algorithms can be applied on the same training-set. > > Apart from that, here we may talking about Gigabytes or even Terabytes of > data. Thus keeping subsets is not the best thing in anyway. > > Another example is that once a clustering algorithm is applied, users may >> want to preserve data in certain clusters for further processing. In that >> case, users can select one or more clusters and generate subsets of data. > > > WSO2 ML will only produce the models, and would NOT facilitate prediction > using built models. A built model will be published and will be applied by > CEP/BAM/ESB. > Building models (mainly from classification algorithms) and using them in BAM, ESB, etc. is a very good use case of ML. However, are we only focusing on classification (and model generation)? I think clustering is also a major part of data analysis and thus becomes a essential component in a data analysis product. If we want to support clustering, it may be useful to persist (or export) clustered data, as it is the main result of a clustering algorithm. Clusters (or subsets) of data can be persisted by only recording the cluster assignments (i.e. data point ID -> cluster ID) without repeating whole data points. Still there can be large number of mappings for a large dataset. Therefore, this is a trade off that we have to make carefully, as clustering may also take considerable time depending on the size and properties of the dataset. Regards, Chathura
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
