[GitHub] [atlas] vladhlinsky opened a new pull request #89: ATLAS-3646 Create new 'spark_ml_model_dataset', 'spark_ml_pipeline_dataset' relationship defs

GitBox Mon, 02 Mar 2020 03:17:21 -0800

vladhlinsky opened a new pull request #89: ATLAS-3646 Create new 
'spark_ml_model_dataset','spark_ml_pipeline_dataset' relationship defs
URL: https://github.com/apache/atlas/pull/89
 
 
   ## What changes were proposed in this pull request?
   
   Create new `spark_ml_model_dataset` and `spark_ml_pipeline_dataset` 
relationship definitions. This is required in order to integrate Spark Atlas 
Connector's ML event processor.
   
   Previously, Spark Atlas Connector used the `spark_ml_directory` model for 
the ML model directory and `spark_ml_model_ml_directory`, 
`spark_ml_pipeline_ml_directory` relationship definitions. Usage of the 
`spark_ml_directory`  was reverted in the scope of 
https://github.com/hortonworks-spark/spark-atlas-connector/issues/61, 
https://github.com/hortonworks-spark/spark-atlas-connector/pull/62 so ML model 
directory is `DataSet` entity(i.e. `hdfs_path`, `fs_path`).
   
   Thus, new relationship definitions must be created, since there is no 
straightforward way to update existing ones to use `DataSet` type instead of 
it's child type `spark_ml_directory`(see: 
[ATLAS-3640](https://issues.apache.org/jira/browse/ATLAS-3640)).
   
   ## How was this patch tested?
   
   Created the following functions in order to test proposed changes without 
Spark Atlas Connector:
   ```
   function create_ml_directory(){
     NAME=$1
     TIMESTAMP=$(($(date +%s%N)/1000000))
     
ML_DIR="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     
\""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     
\""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     
\""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_directory\"",\""attributes\"":
     
{\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\"",\""uri\"":\""hdfs://\"",\""directory\"":\""/test\""},
     
\""isIncomplete\"":false,\""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}"
     echo $ML_DIR | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK 
--broker-list localhost:9092
   }
   
   function create_ml_model(){
     NAME=$1
     REL_NAME=$2
     DIR_TYPE=$3
     DIR_NAME=$4
     TIMESTAMP=$(($(date +%s%N)/1000000))
     
ML_MODEL="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     
\""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     
\""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     
\""entities\"":{\""entities\"":[{\""typeName\"":\""spark_ml_model\"",\""attributes\"":
     
{\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false,\""provenanceType\"":0,
     
\""version\"":0,\""relationshipAttributes\"":{\""$REL_NAME\"":{\""typeName\"":\""$DIR_TYPE\"",
     
\""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}"
     echo $ML_MODEL | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK 
--broker-list localhost:9092
   }
   
   function create_ml_pipeline(){
     NAME=$1
     REL_NAME=$2
     DIR_TYPE=$3
     DIR_NAME=$4
     TIMESTAMP=$(($(date +%s%N)/1000000))
     
ML_PIPELINE="{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",\""entities\"":{\""entities\"":[{\""typeName\"":
     
\""spark_ml_pipeline\"",\""attributes\"":{\""qualifiedName\"":\""$NAME\"",\""name\"":\""$NAME\""},\""isIncomplete\"":
     
false,\""provenanceType\"":0,\""version\"":0,\""relationshipAttributes\"":{\""$REL_NAME\"":{\""typeName\"":
     
\""$DIR_TYPE\"",\""uniqueAttributes\"":{\""qualifiedName\"":\""$DIR_NAME\""}}},\""proxy\"":false}]}}}"
     echo $ML_PIPELINE | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK 
--broker-list localhost:9092
   }
   
   function create_hdfs_path(){
     NAME=$1
     TIMESTAMP=$(($(date +%s%N)/1000000))
     
HDFS_PATH="{\""version\"":{\""version\"":\""1.0.0\"",\""versionParts\"":[1]},\""msgCompressionKind\"":\""NONE\"",
     
\""msgSplitIdx\"":1,\""msgSplitCount\"":1,\""msgSourceIP\"":\""172.27.12.6\"",\""msgCreatedBy\"":\""test\"",
     
\""msgCreationTime\"":$TIMESTAMP,\""message\"":{\""type\"":\""ENTITY_CREATE_V2\"",\""user\"":\""test\"",
     
\""entities\"":{\""entities\"":[{\""typeName\"":\""hdfs_path\"",\""attributes\"":{\""path\"":\""$NAME\"",
     
\""qualifiedName\"":\""$NAME\"",\""clusterName\"":\""test\"",\""name\"":\""$NAME\""},\""isIncomplete\"":false,
     \""provenanceType\"":0,\""version\"":0,\""proxy\"":false}]}}}"
     echo $HDFS_PATH | ./bin/kafka-console-producer.sh --topic ATLAS_HOOK 
--broker-list localhost:9092
   }
   ```
   
   Tested manually as follows:
   - Start Atlas.
   - Create Spark ML model and ML directory entities:
   ```
   create_ml_directory mldir_before_upgrade
   create_ml_model model_with_mldir_before_upgrade directory spark_ml_directory 
mldir_before_upgrade
   ```
   - Verify that `model_with_mldir_before_upgrade` is accessible via Atlas UI 
and has `mldir_before_upgrade` as `directory`.
   - Stop Atlas.
   - Update `1100-spark_model.json` with proposed changes.
   - Start Atlas.
   - Verify that previously created entities exist.
   - Create Spark ML model and ML directory entities:
   ```
   create_ml_directory mldir
   create_ml_model model_with_mldir directory spark_ml_directory mldir
   ```
   - Verify that `model_with_mldir` is accessible via Atlas UI and has `mldir` 
as `directory`.
   - Create Spark ML model and `hdfs_path` entities:
   ```
   create_hdfs_path path
   create_ml_model model_with_path dataset hdfs_path path
   ```
   - Verify that `model_with_path` is accessible via Atlas UI and has `path` as 
`dataset`.
   - Create Spark ML pipeline and ML directory entities:
   ```
   create_ml_directory mldir2
   create_ml_pipeline pipeline_with_mldir directory spark_ml_directory mldir2
   ```
   - Verify that `pipeline_with_mldir` is accessible via Atlas UI and has 
`mldir2` as `directory`.
   - Create Spark ML pipeline and `hdfs_path` entities:
   ```
   create_hdfs_path path2
   create_ml_pipeline pipeline_with_path dataset hdfs_path path2
   ```
   - Verify that `pipeline_with_path` is accessible via Atlas UI and has 
`path2` as `dataset`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [atlas] vladhlinsky opened a new pull request #89: ATLAS-3646 Create new 'spark_ml_model_dataset', 'spark_ml_pipeline_dataset' relationship defs

Reply via email to