turcsanyip commented on a change in pull request #3979: NIFI-7009: Atlas reporting task retrieves only the active flow compon… URL: https://github.com/apache/nifi/pull/3979#discussion_r367326456
########## File path: nifi-nar-bundles/nifi-atlas-bundle/nifi-atlas-reporting-task/src/main/java/org/apache/nifi/atlas/NiFiAtlasClient.java ########## @@ -201,12 +202,12 @@ public NiFiFlow fetchNiFiFlow(String rootProcessGroupId, String clusterName) thr nifiFlow.setUrl(toStr(attributes.get(ATTR_URL))); nifiFlow.setDescription(toStr(attributes.get(ATTR_DESCRIPTION))); - nifiFlow.getQueues().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_QUEUES)))); - nifiFlow.getRootInputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_INPUT_PORTS)))); - nifiFlow.getRootOutputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_OUTPUT_PORTS)))); + nifiFlow.getQueues().putAll(fetchFlowComponents(TYPE_NIFI_QUEUE, nifiFlowReferredEntities)); + nifiFlow.getRootInputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_INPUT_PORT, nifiFlowReferredEntities)); + nifiFlow.getRootOutputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_OUTPUT_PORT, nifiFlowReferredEntities)); final Map<String, NiFiFlowPath> flowPaths = nifiFlow.getFlowPaths(); - final Map<AtlasObjectId, AtlasEntity> flowPathEntities = toQualifiedNameIds(toAtlasObjectIds(attributes.get(ATTR_FLOW_PATHS))); + final Map<AtlasObjectId, AtlasEntity> flowPathEntities = fetchFlowComponents(TYPE_NIFI_FLOW_PATH, nifiFlowReferredEntities); for (AtlasEntity flowPathEntity : flowPathEntities.values()) { Review comment: Retrieving the flow and its components is composed of 3 steps: 1. get the flow 2. get the flow paths (and other subcomponents of the flow) 3. get the inputs / outputs of the flow paths Input / output data was returned in all 3 steps which was definitely overkill. [NIFI-6945](https://issues.apache.org/jira/browse/NIFI-6945) was about excluding the input/output data from step 1. There are 2 reasons for doing this: Getting all inputs/outputs of all flow paths in a single query can lead to a huge response json from Atlas (extremely high response time / timeout). Furthermore, there is no way to query only the ACTIVE `referredEntities` from Atlas. So in case of `minExtInfo=true`, all the deleted flow paths were retrieved with all their inputs/outputs unnecessarily (they will be filtered out in step 2). So I believe that having multiple but smaller requests in step 2 is better than having one huge request in step 1. Your comment about falling back to the old logic (in step 3) is reasonable. Input / output data is still retrieved multiple times in step 2 and 3. My plan is to create a separate ticket for it. I would keep the scope of this ticket on the active/delete filtering of the flow paths only. Question of `ignoreRelationship=true` would belong to the new jira. It needs to be investigated which solution is better: - keep the current step 3 (lots of very small requests), in this case `ignoreRelationship=true` can be set for step 2 which would decrease the size of that request significantly - use the relationship data returned in step2 (one bigger request) and eliminate step 3 at all The performance of these options needs to be tested with both the Simple Path / Complex Path lineage strategy settings of the reporting task. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services