turcsanyip commented on a change in pull request #3979: NIFI-7009: Atlas
reporting task retrieves only the active flow compon…
URL: https://github.com/apache/nifi/pull/3979#discussion_r367326456
##########
File path:
nifi-nar-bundles/nifi-atlas-bundle/nifi-atlas-reporting-task/src/main/java/org/apache/nifi/atlas/NiFiAtlasClient.java
##########
@@ -201,12 +202,12 @@ public NiFiFlow fetchNiFiFlow(String rootProcessGroupId,
String clusterName) thr
nifiFlow.setUrl(toStr(attributes.get(ATTR_URL)));
nifiFlow.setDescription(toStr(attributes.get(ATTR_DESCRIPTION)));
-
nifiFlow.getQueues().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_QUEUES))));
-
nifiFlow.getRootInputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_INPUT_PORTS))));
-
nifiFlow.getRootOutputPortEntities().putAll(toQualifiedNameIds(toAtlasObjectIds(nifiFlowEntity.getAttribute(ATTR_OUTPUT_PORTS))));
+ nifiFlow.getQueues().putAll(fetchFlowComponents(TYPE_NIFI_QUEUE,
nifiFlowReferredEntities));
+
nifiFlow.getRootInputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_INPUT_PORT,
nifiFlowReferredEntities));
+
nifiFlow.getRootOutputPortEntities().putAll(fetchFlowComponents(TYPE_NIFI_OUTPUT_PORT,
nifiFlowReferredEntities));
final Map<String, NiFiFlowPath> flowPaths = nifiFlow.getFlowPaths();
- final Map<AtlasObjectId, AtlasEntity> flowPathEntities =
toQualifiedNameIds(toAtlasObjectIds(attributes.get(ATTR_FLOW_PATHS)));
+ final Map<AtlasObjectId, AtlasEntity> flowPathEntities =
fetchFlowComponents(TYPE_NIFI_FLOW_PATH, nifiFlowReferredEntities);
for (AtlasEntity flowPathEntity : flowPathEntities.values()) {
Review comment:
Retrieving the flow and its components is composed of 3 steps:
1. get the flow
2. get the flow paths (and other subcomponents of the flow)
3. get the inputs / outputs of the flow paths
Input / output data was returned in all 3 steps which was definitely
overkill.
[NIFI-6945](https://issues.apache.org/jira/browse/NIFI-6945) was about
excluding the input/output data from step 1. There are 2 reasons for doing
this: Getting all inputs/outputs of all flow paths in a single query can lead
to a huge response json from Atlas (extremely high response time / timeout).
Furthermore, there is no way to query only the ACTIVE `referredEntities` from
Atlas. So in case of `minExtInfo=true`, all the deleted flow paths were
retrieved with all their inputs/outputs unnecessarily (they will be filtered
out in step 2). So I believe that having multiple but smaller requests in step
2 is better than having one huge request in step 1.
Your comment about falling back to the old logic (in step 3) is reasonable.
Input / output data is still retrieved multiple times in step 2 and 3. My plan
is to create a separate ticket for it. I would keep the scope of this ticket on
the active/delete filtering of the flow paths only.
Question of `ignoreRelationship=true` would belong to the new jira. It needs
to be investigated which solution is better:
- keep the current step 3 (lots of very small requests), in this case
`ignoreRelationship=true` can be set for step 2 which would decrease the size
of that request significantly
- use the relationship data returned in step2 (one bigger request) and
eliminate step 3 at all
The performance of these options needs to be tested with both the Simple
Path / Complex Path lineage strategy settings of the reporting task.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services