[ 
https://issues.apache.org/jira/browse/ATLAS-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barbara Eckman reassigned ATLAS-3570:
-------------------------------------

    Assignee:     (was: Barbara Eckman)

> Atlas typedefs for Machine Learning Models, Feature Sets, and Feature 
> Engineering Engines
> -----------------------------------------------------------------------------------------
>
>                 Key: ATLAS-3570
>                 URL: https://issues.apache.org/jira/browse/ATLAS-3570
>             Project: Atlas
>          Issue Type: New Feature
>            Reporter: Barbara Eckman
>            Priority: Major
>         Attachments: MLModel_typedefs.tar
>
>
> Currently the base types in Atlas do not include Machine Learning (ML) Model 
> tables. It would be nice to add typedefs for them, so they could be part of 
> enterprise discovery and versioning.  
> ENTITIES COULD INCLUDE:
> MLModel (overview info), with attributes:
>  * uniqueId
>  * version
>  * businessUseCase
>  * modelFramework (eg scikit-learn)
>  * modelTypes (eg random forest regressor)
>  * modelClass (eg random forest (bagging + decision trees))
>  * isEnsemble boolean
>  * outcomeTypeDescription (eg single float)
>  * **dataScienceOwnerEmail
>  * githubRepoURL where the model code is founc
>  * modelDeploymentDate
>  * populationScored (eg in Comcast, residential or business customers)
>  * accuracyMeasures
> MLModelExecution, with attributes:
>  * exampleInputDatasetURL (URL where a sample input dataset can be found)
>  * outputTargetDatasetURLs
>  * opsOwnerEmail
>  * executionEndpointURL
>  * dockerContainerURL
>  * MLFlowPointerURL
>  * executionNotebookURL (eg Databricks, Jupyter)
> MLModelTraining, with attributes:
>  * hyperParameters
>  * trainingDatasetURLs
>  * trainingNotebookURL (eg Databricks, Jupyter)
> FeatureSet (a set of features prepared as input to an ML model), with 
> attributes:
>  * version
>  * locationURL 
> FeatureEngineeringEngine (the engine that generates the feature set for an ML 
> model), with attributes:
>  * version
>  * ownerEmail
>  * inputSourceURL
>  * processingEngineInfoURL (docs on the processing engine)
>  * githubRepoURL 
>  * outputTargetURL
> RELATIONSHIPS could include:
>  * model to  execution
>  * model to training
>  * model execution to example input dataset (eg kafka topic)
>  * model execution to output target dataset (eg S3 prefix or object)
>  * model execution to input schema
>  * model execution to output schema
>  * model execution to input feature set objects
>  * training to input training dataset objects
>  * training to input training dataset schema
>  * feature engineering engine to output feature set object
>  * feature engineering engine to input source dataset (eg kafka topic)
>  * feature engineering engine to input source dataset's schema
>  * feature engineering engine to output target dataset (eg S3 object)
>  * feature set object to its schema
> ENUMs could include:
>  * MLModel_type (eg logistic regression, random_forest_regression)
> PROCESSES related to MLModels could include:
>  * MLPipelineDependencyEdge (dependency between two models in the ML pipeline)
>  ** inputs and outputs are both MLModels
>  * MLModelEvolutionEdge (lineage between 2 versions of an ML model)
>  ** inputs and outputs are both MLModels
>  ** only attribute is an array of strings representing changes made from one 
> version to the other.  this could be made more structured as we discover how 
> it is used.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to