[jira] [Commented] (NIFI-3152) If Provenance Repository runs out of disk space, it may not recover even when disk space is freed up

Bryan Bende (JIRA) Mon, 05 Dec 2016 13:09:13 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723371#comment-15723371
 ]


Bryan Bende commented on NIFI-3152:
-----------------------------------

Just wanted to document my findings here since my GitHub comment didn't post 
through...

I hard-coded an exception in the old code right after writer.commit() to 
simulate an error, and I let a GenerateFlowFile -> UpdateAttribute run as fast 
as possible... around 418k flow files through, the flow basically froze and 
both processors had an active thread on them, the logs eventually showed:

{code}
2016-12-05 15:50:15,955 ERROR [Provenance Repository Rollover Thread-1] 
o.a.n.p.PersistentProvenanceRepository
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
NativeFSLock@/Users/bbende/Projects/bbende-nifi/nifi-assembly/target/nifi-1.2.0-SNAPSHOT-bin/nifi-1.2.0-SNAPSHOT/provenance_repository/index-1480969697000/write.lock
  at org.apache.lucene.store.Lock.obtain(Lock.java:89) ~[na:na]
  at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:755) ~[na:na]
  at 
org.apache.nifi.provenance.lucene.SimpleIndexManager.borrowIndexWriter(SimpleIndexManager.java:120)
 ~[na:na]
  at 
org.apache.nifi.provenance.PersistentProvenanceRepository.mergeJournals(PersistentProvenanceRepository.java:1732)
 ~[na:na]
  at 
org.apache.nifi.provenance.PersistentProvenanceRepository$8.run(PersistentProvenanceRepository.java:1323)
 ~[na:na]
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_74]
  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_74]
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_74]
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_74]
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_74]
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_74]
  at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
2016-12-05 15:50:15,955 WARN [Provenance Repository Rollover Thread-1] 
o.a.n.p.PersistentProvenanceRepository Couldn't merge journals. Will try again. 
journalsToMerge:
{code}

A thread dump also shows the same blocked thread on the index directory as 
described in the JIRA descrption.

Then I waited a few minutes and the processor stats went down to 0 and the 
active threads were still showing so it was clearly stuck.

I retried the same scenario with:
{code}
try {
  writer.commit();
 throw new Exception();
} finally {
  count.close();
}
{code}

And this allowed the rollovers to succeed and the flow continue working well 
past the previous point.

So I'm a +1 on this patch and will merge.

> If Provenance Repository runs out of disk space, it may not recover even when 
> disk space is freed up
> ----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-3152
>                 URL: https://issues.apache.org/jira/browse/NIFI-3152
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>             Fix For: 1.1.1
>
>
> If we run out of disk space in the provenance repository, we can sometimes 
> get into a situation where the logs show us still waiting for the repo to 
> roll over, even after disk space is freed up. A thread dump shows that the 
> processors are trying to force the repo to rollover. However, the rollover 
> never completes because we can't create an IndexWriter:
> {code}
> "Provenance Repository Rollover Thread-1" Id=128 TIMED_WAITING  on null
>       at java.lang.Thread.sleep(Native Method)
>       at org.apache.lucene.store.Lock.obtain(Lock.java:92)
>       at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:755)
>       at 
> org.apache.nifi.provenance.lucene.SimpleIndexManager.borrowIndexWriter(SimpleIndexManager.java:104)
>       - waiting on 
> org.apache.nifi.provenance.lucene.SimpleIndexManager@22f9da45
>       at 
> org.apache.nifi.provenance.PersistentProvenanceRepository.mergeJournals(PersistentProvenanceRepository.java:1711)
>       at 
> org.apache.nifi.provenance.PersistentProvenanceRepository$8.run(PersistentProvenanceRepository.java:1311)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
>       Number of Locked Synchronizers: 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@850f87f
> {code}
> The Index Writer is blocking on a lock, waiting to obtain a write lock for 
> the Directory.
> Digging around, I believe the issue is that if we call 
> SimpleIndexManager.returnIndexWriter, it will call IndexWriter.commit(). But 
> if that throws an Exception, we don't properly close the writer. If we are 
> running out of disk space, it is likely that we will throw an Exception on 
> IndexWriter.commit() so this appears to be the root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-3152) If Provenance Repository runs out of disk space, it may not recover even when disk space is freed up

Reply via email to