turcsanyip commented on a change in pull request #3979: NIFI-7009: Atlas 
reporting task retrieves only the active flow compon…
URL: https://github.com/apache/nifi/pull/3979#discussion_r367326456
 
 

 ##########
 File path: 
nifi-nar-bundles/nifi-atlas-bundle/nifi-atlas-reporting-task/src/main/java/org/apache/nifi/atlas/NiFiAtlasClient.java
 ##########
 @@ -201,12 +202,12 @@ public NiFiFlow fetchNiFiFlow(String rootProcessGroupId, 
String clusterName) thr
         nifiFlow.setUrl(toStr(attributes.get(ATTR_URL)));
         nifiFlow.setDescription(toStr(attributes.get(ATTR_DESCRIPTION)));
 
-        
nifiFlow.getQueues().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_QUEUES))));
-        
nifiFlow.getRootInputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_INPUT_PORTS))));
-        
nifiFlow.getRootOutputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_OUTPUT_PORTS))));
+        nifiFlow.getQueues().putAll(fetchFlowComponents(TYPE_NIFI_QUEUE, 
nifiFlowReferredEntities));
+        
nifiFlow.getRootInputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_INPUT_PORT,
 nifiFlowReferredEntities));
+        
nifiFlow.getRootOutputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_OUTPUT_PORT,
 nifiFlowReferredEntities));
 
         final Map<String, NiFiFlowPath> flowPaths = nifiFlow.getFlowPaths();
-        final Map<AtlasObjectId, AtlasEntity> flowPathEntities = 
toQualifiedNameIds(toAtlasObjectIds(attributes.get(ATTR_FLOW_PATHS)));
+        final Map<AtlasObjectId, AtlasEntity> flowPathEntities = 
fetchFlowComponents(TYPE_NIFI_FLOW_PATH, nifiFlowReferredEntities);
 
         for (AtlasEntity flowPathEntity : flowPathEntities.values()) {
 
 Review comment:
   Retrieving the flow and its components is composed of 3 steps:
   
   1. get the flow
   2. get the flow paths (and other subcomponents of the flow)
   3. get the inputs / outputs of the flow paths
   
   Input / output data was returned in all 3 steps which was definitely 
overkill.
   
   [NIFI-6945](https://issues.apache.org/jira/browse/NIFI-6945) was about 
excluding the input/output data from step 1. There are 2 reasons for doing 
this: Getting all inputs/outputs of all flow paths in a single query can lead 
to a huge response json from Atlas (extremely high response time / timeout). 
Furthermore, there is no way to query only the ACTIVE `referredEntities` from 
Atlas. So in case of `minExtInfo=true`, all the deleted flow paths were 
retrieved with all their inputs/outputs unnecessarily (they will be filtered 
out in step 2). So I believe that having multiple but smaller requests in step 
2 is better than having one huge request in step 1.
   
   Your comment about falling back to the old logic (in step 3) is reasonable. 
Input / output data is still retrieved multiple times in step 2 and 3. My plan 
is to create a separate ticket for it. I would keep the scope of this ticket on 
the active/delete filtering of the flow paths only.
   
   Question of `ignoreRelationship=true` would belong to the new jira. It needs 
to be investigated which solution is better:
   
   - keep the current step 3 (lots of very small requests), in this case 
`ignoreRelationship=true` can be set for step 2 which would decrease the size 
of that request significantly
   - use the relationship data returned in step2 (one bigger request) and 
eliminate step 3 at all
   
   The performance of these options needs to be tested with both the Simple 
Path / Complex Path lineage strategy settings of the reporting task.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to