vladhlinsky opened a new pull request #89: ATLAS-3646 Create new 'spark_ml_model_dataset','spark_ml_pipeline_dataset' relationship defs URL: https://github.com/apache/atlas/pull/89 ## What changes were proposed in this pull request? Create new `spark_ml_model_dataset` and `spark_ml_pipeline_dataset` relationship definitions. This is required in order to integrate Spark Atlas Connector's ML event processor. Previously, Spark Atlas Connector used the `spark_ml_directory` model for the ML model directory and `spark_ml_model_ml_directory`, `spark_ml_pipeline_ml_directory` relationship definitions. Usage of the `spark_ml_directory` was reverted in the scope of https://github.com/hortonworks-spark/spark-atlas-connector/issues/61, https://github.com/hortonworks-spark/spark-atlas-connector/pull/62 so ML model directory is `DataSet` entity(i.e. `hdfs_path`, `fs_path`). Thus, new relationship definitions must be created, since there is no straightforward way to update existing ones to use `DataSet` type instead of it's child type `spark_ml_directory`(see: [ATLAS-3640](https://issues.apache.org/jira/browse/ATLAS-3640)). ## How was this patch tested? Created the following functions in order to test proposed changes without Spark Atlas Connector: ``` function create_ml_directory(){ NAME=$1 TIMESTAMP=$(($(date +%s%N)/1000000)) ML_DIR="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"", \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"", \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"", \""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_directory\"",\""attributes\"": {\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\"",\""uri\"":\""hdfs://\"",\""directory\"":\""/test\""}, \""isIncomplete\"":false,\""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}" echo $ML_DIR | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092 } function create_ml_model(){ NAME=$1 REL_NAME=$2 DIR_TYPE=$3 DIR_NAME=$4 TIMESTAMP=$(($(date +%s%N)/1000000)) ML_MODEL="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"", \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"", \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"", \""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_model\"",\""attributes\"": {\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false,\""provenanceType\"":0, \""version\"":0,\""relationshipAttributes\"":{\""$REL_NAME\"":{\""typeName\"":\""$DIR_TYPE\"", \""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}" echo $ML_MODEL | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092 } function create_ml_pipeline(){ NAME=$1 REL_NAME=$2 DIR_TYPE=$3 DIR_NAME=$4 TIMESTAMP=$(($(date +%s%N)/1000000)) ML_PIPELINE="{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",\""entities\"":{\""entities\"":[{\""typeName\"": \""spark_ml_pipeline\"",\""attributes\"":{\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"": false,\""provenanceType\"":0,\""version\"":0,\""relationshipAttributes\"":{\""$REL_NAME\"":{\""typeName\"": \""$DIR_TYPE\"",\""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}" echo $ML_PIPELINE | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092 } function create_hdfs_path(){ NAME=$1 TIMESTAMP=$(($(date +%s%N)/1000000)) HDFS_PATH="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"", \""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"", \""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"", \""entities\"":{\""entities\"":[{\""typeName\"":\""hdfs_path\"",\""attributes\"":{\""path\"":\""$NAME\"", \""qualifiedName\"":\""$NAME\"",\""clusterName\"":\""test\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false, \""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}" echo $HDFS_PATH | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK --broker-list localhost:9092 } ``` Tested manually as follows: - Start Atlas. - Create Spark ML model and ML directory entities: ``` create_ml_directory mldir_before_upgrade create_ml_model model_with_mldir_before_upgrade directory spark_ml_directory mldir_before_upgrade ``` - Verify that `model_with_mldir_before_upgrade` is accessible via Atlas UI and has `mldir_before_upgrade` as `directory`. - Stop Atlas. - Update `1100-spark_model.json` with proposed changes. - Start Atlas. - Verify that previously created entities exist. - Create Spark ML model and ML directory entities: ``` create_ml_directory mldir create_ml_model model_with_mldir directory spark_ml_directory mldir ``` - Verify that `model_with_mldir` is accessible via Atlas UI and has `mldir` as `directory`. - Create Spark ML model and `hdfs_path` entities: ``` create_hdfs_path path create_ml_model model_with_path dataset hdfs_path path ``` - Verify that `model_with_path` is accessible via Atlas UI and has `path` as `dataset`. - Create Spark ML pipeline and ML directory entities: ``` create_ml_directory mldir2 create_ml_pipeline pipeline_with_mldir directory spark_ml_directory mldir2 ``` - Verify that `pipeline_with_mldir` is accessible via Atlas UI and has `mldir2` as `directory`. - Create Spark ML pipeline and `hdfs_path` entities: ``` create_hdfs_path path2 create_ml_pipeline pipeline_with_path dataset hdfs_path path2 ``` - Verify that `pipeline_with_path` is accessible via Atlas UI and has `path2` as `dataset`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
