[ 
https://issues.apache.org/jira/browse/NIFI-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398158#comment-16398158
 ] 

Koji Kawamura commented on NIFI-4971:
-------------------------------------

h3. Why does it happen?
This issue happens in the following scenario:

# NiFi reports a flow_path entity (P1) via a ENTITY_CREATE message for GetFile 
and PutFile process path. At this point the flow_path entity does not have 
inputs/outputs because corresponding DataSet entities do not exist yet.
# NiFi reports a fs_path (F1) entity representing a file received by GetFile, 
via a ENTITY_CREATE message
# NiFi reports a fs_path (F2) entity representing a file sent by PutFile, via a 
ENTITY_CREATE message
# NiFi fetches an existing flow_path entity from Atlas
# NiFi reports lineage via a ENTITY_PARTIAL_UPDATE message, to update 
inputs/outputs attribute, P1, input(F1), output(F2)

At no.4 above, current implementation expects an existing entity is returned. 
However, it will not be found when the message sent at no.1 has not processed 
by Atlas yet. In such case, no.5 is not executed, and produces the 
'nifi_flow_path' entities without having the correct inputs/outputs attribute.

h3. Why is it implemented like that?
ReportLineageToAtlas uses Atlas Hook to report lineage, meaning by sending 
Kafka messages.

In order to avoid removing existing entries from inputs or outputs attribute of 
an existing nifi_flow_path entity, it fetches existing nifi_flow_path entity 
before creating an ENTITY_PARTIAL_UPDATE message. E.g. when existing 
nifi_flow_path P1 has inputs(f1, f2) and outputs(f3), then an 
ENTITY_PARTIAL_UPDATE is sent to the entity with only new elements, input(f4) 
and output(f5), then Atlas updates the P1 as inputs(f1 - deleted, f2 - deleted, 
f4) and outputs(f4 - deleted, f5). We need to send an ENTITY_PARTIAL_UPDATE 
with all existing elements and newly found elements, i.e. inputs(f1, f2, f4) 
and outputs(f3, f5).

h3. How critical it is?
With 'simple_path', 'nifi_flow_path' entities are created before provenance 
events are analyzed, so this issue does not happen.
For NiFi flows those process the same 'nifi_flow_path' multiple times against 
the same inputs/outputs, the expected lineage can be reported at the 2nd time 
or later, as ReportLineageAtlas can find existing 'nifi_flow_path' entities.

> ReportLineageToAtlas 'complete path' strategy can miss one-time lineages
> ------------------------------------------------------------------------
>
>                 Key: NIFI-4971
>                 URL: https://issues.apache.org/jira/browse/NIFI-4971
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.5.0
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>            Priority: Major
>
> For the simplest example, with GetFlowFIle (GFF) -> PutFlowFile (PFF), where 
> GFF gets files and PFF saves those files into a different directory, then 
> following provenance events will be generated:
>  # GFF RECEIVE file1
>  # PFF SEND file2
> From above provenance events, following entities and lineages should be 
> created in Atlas, labels in brackets are Atlas type names:
> {code}
> file1 (fs_path) -> GFF, PFF (nifi_flow_path) -> file2 (fs_path)
> {code}
> Entities shown in above graph are created. However, the 'nifi_flow_path' 
> entity do not have inputs/outputs referencing 'fs_path', so lineage can not 
> be seen in Atlas UI.
> This issue was discovered by [~nayakmahesh616]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to