[ https://issues.apache.org/jira/browse/ATLAS-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Barbara Eckman reassigned ATLAS-3570: ------------------------------------- Assignee: (was: Barbara Eckman) > Atlas typedefs for Machine Learning Models, Feature Sets, and Feature > Engineering Engines > ----------------------------------------------------------------------------------------- > > Key: ATLAS-3570 > URL: https://issues.apache.org/jira/browse/ATLAS-3570 > Project: Atlas > Issue Type: New Feature > Reporter: Barbara Eckman > Priority: Major > Attachments: MLModel_typedefs.tar > > > Currently the base types in Atlas do not include Machine Learning (ML) Model > tables. It would be nice to add typedefs for them, so they could be part of > enterprise discovery and versioning. > ENTITIES COULD INCLUDE: > MLModel (overview info), with attributes: > * uniqueId > * version > * businessUseCase > * modelFramework (eg scikit-learn) > * modelTypes (eg random forest regressor) > * modelClass (eg random forest (bagging + decision trees)) > * isEnsemble boolean > * outcomeTypeDescription (eg single float) > * **dataScienceOwnerEmail > * githubRepoURL where the model code is founc > * modelDeploymentDate > * populationScored (eg in Comcast, residential or business customers) > * accuracyMeasures > MLModelExecution, with attributes: > * exampleInputDatasetURL (URL where a sample input dataset can be found) > * outputTargetDatasetURLs > * opsOwnerEmail > * executionEndpointURL > * dockerContainerURL > * MLFlowPointerURL > * executionNotebookURL (eg Databricks, Jupyter) > MLModelTraining, with attributes: > * hyperParameters > * trainingDatasetURLs > * trainingNotebookURL (eg Databricks, Jupyter) > FeatureSet (a set of features prepared as input to an ML model), with > attributes: > * version > * locationURL > FeatureEngineeringEngine (the engine that generates the feature set for an ML > model), with attributes: > * version > * ownerEmail > * inputSourceURL > * processingEngineInfoURL (docs on the processing engine) > * githubRepoURL > * outputTargetURL > RELATIONSHIPS could include: > * model to execution > * model to training > * model execution to example input dataset (eg kafka topic) > * model execution to output target dataset (eg S3 prefix or object) > * model execution to input schema > * model execution to output schema > * model execution to input feature set objects > * training to input training dataset objects > * training to input training dataset schema > * feature engineering engine to output feature set object > * feature engineering engine to input source dataset (eg kafka topic) > * feature engineering engine to input source dataset's schema > * feature engineering engine to output target dataset (eg S3 object) > * feature set object to its schema > ENUMs could include: > * MLModel_type (eg logistic regression, random_forest_regression) > PROCESSES related to MLModels could include: > * MLPipelineDependencyEdge (dependency between two models in the ML pipeline) > ** inputs and outputs are both MLModels > * MLModelEvolutionEdge (lineage between 2 versions of an ML model) > ** inputs and outputs are both MLModels > ** only attribute is an array of strings representing changes made from one > version to the other. this could be made more structured as we discover how > it is used. > -- This message was sent by Atlassian Jira (v8.3.4#803005)