[ 
https://issues.apache.org/jira/browse/NIFI-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418871#comment-15418871
 ] 

Mark Payne commented on NIFI-2551:
----------------------------------

[~mosermw] - The good news is that I am to very easily reproduce this issue. To 
do so, I just created a dummy processor that updates the content of a FlowFile 
multiple times within the same ProcessSession. As soon I start running, I see 
the NPE getting thrown very easily.  Your comment above about Thread #2 tipped 
me off to this: StandardProcessSession.removeTemporaryClaim() calls 
FileSystemRepository.decrementClaimantCount() then calls 
FileSystemRepository.remove() on claim #100.

This means block of code is only entered if we modify the content of a FlowFile 
multiple times in the same session.

When I apply your patch I do see the problem going away, but I've spent about 
8-10 hours yesterday/this morning trying to understand this in more detail. 
What I have discovered, I believe, is that the threading problem is not really 
in FileSystemRepository but rather in the StandardResourceClaimManager. This is 
accessed through the FileSystemRepository about 95% of the time, and so that's 
why synchronizing there makes the problem go away. But I don't believe it is 
truly addressing the underlying issue.

What I saw was in StandardResourceClaimManager, we can have the case where:

[Thread-1] calls decrementClaimantCount(). This decrements count to 0 but does 
not finish this method yet.
[Thread-2] calls incrementClaimantCount(). This increments count to 1.
[Thread-1] then calls claimantCounts.remove(claim). As a result, we have 
removed the claimant count, which effectively resets it back to 0. Calls to 
getClaimantCount() now return 0 instead of 1.

So if we synchronize in FileSystemRepository it does in fact prevent this 
condition *if* access is only through FileSystemRepository. But there are still 
a few places where this can bite us. Specifically, in 
WriteAheadFlowFileRepository, it calls getClaimantCount() and if 0 it marks the 
claim as destructible. So this has to be addressed at a more broad level.

In addition to this, when we are calling ContentRepository.remove() from 
StandardProcessSession... that really is a bug as well. This was correct when 
that code was written, as at the time there was no concept of a Resource Claim, 
so it made sense to destroy the claim then. Now, however, the Resource Claim 
may be in use by other Content Claims, so that needs to be addressed.

The great news is that I've got a setup now where it's super trivial to 
reproduce this. I'll put together a patch and would love to have you review it 
if you can, to make sure that you agree that what we're doing is correct.

Thanks!
-Mark


> Rare condition causes FileSystemRepository NPE
> ----------------------------------------------
>
>                 Key: NIFI-2551
>                 URL: https://issues.apache.org/jira/browse/NIFI-2551
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.0.0, 0.7.0
>            Reporter: Michael Moser
>            Assignee: Michael Moser
>            Priority: Blocker
>
> In rare unpredictable cases when NiFi is processing a heavy load, we see 
> FileSystemRepository throw a NullPointerException
> {noformat}
> java.lang.NullPointerException
>     at o.a.n.c.r.FileSystemRepository$2.write(FileSystemRepository.java:918) 
> [nifi-framework-core-0.7.0.jar]
>     at 
> o.a.n.c.r.io.DisableOnCloseOutputStream.write(DisableOnCloseOutputStream.java:49)
>     ....
>     Suppressed: java.lang.NullPointerException
>         at 
> o.a.n.c.r.FileSystemRepository$2.flush(FileSystemRepository.java:934) 
> [nifi-framework-core-0.7.0.jar]
>         at 
> o.a.n.c.r.io.DisableOnCloseOutputStream.close(DisableOnCloseOutputStream.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to