[ https://issues.apache.org/jira/browse/NIFI-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723371#comment-15723371 ]
Bryan Bende commented on NIFI-3152: ----------------------------------- Just wanted to document my findings here since my GitHub comment didn't post through... I hard-coded an exception in the old code right after writer.commit() to simulate an error, and I let a GenerateFlowFile -> UpdateAttribute run as fast as possible... around 418k flow files through, the flow basically froze and both processors had an active thread on them, the logs eventually showed: {code} 2016-12-05 15:50:15,955 ERROR [Provenance Repository Rollover Thread-1] o.a.n.p.PersistentProvenanceRepository org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/Users/bbende/Projects/bbende-nifi/nifi-assembly/target/nifi-1.2.0-SNAPSHOT-bin/nifi-1.2.0-SNAPSHOT/provenance_repository/index-1480969697000/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:89) ~[na:na] at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:755) ~[na:na] at org.apache.nifi.provenance.lucene.SimpleIndexManager.borrowIndexWriter(SimpleIndexManager.java:120) ~[na:na] at org.apache.nifi.provenance.PersistentProvenanceRepository.mergeJournals(PersistentProvenanceRepository.java:1732) ~[na:na] at org.apache.nifi.provenance.PersistentProvenanceRepository$8.run(PersistentProvenanceRepository.java:1323) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_74] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_74] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_74] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_74] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_74] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_74] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74] 2016-12-05 15:50:15,955 WARN [Provenance Repository Rollover Thread-1] o.a.n.p.PersistentProvenanceRepository Couldn't merge journals. Will try again. journalsToMerge: {code} A thread dump also shows the same blocked thread on the index directory as described in the JIRA descrption. Then I waited a few minutes and the processor stats went down to 0 and the active threads were still showing so it was clearly stuck. I retried the same scenario with: {code} try { writer.commit(); throw new Exception(); } finally { count.close(); } {code} And this allowed the rollovers to succeed and the flow continue working well past the previous point. So I'm a +1 on this patch and will merge. > If Provenance Repository runs out of disk space, it may not recover even when > disk space is freed up > ---------------------------------------------------------------------------------------------------- > > Key: NIFI-3152 > URL: https://issues.apache.org/jira/browse/NIFI-3152 > Project: Apache NiFi > Issue Type: Bug > Reporter: Mark Payne > Assignee: Mark Payne > Fix For: 1.1.1 > > > If we run out of disk space in the provenance repository, we can sometimes > get into a situation where the logs show us still waiting for the repo to > roll over, even after disk space is freed up. A thread dump shows that the > processors are trying to force the repo to rollover. However, the rollover > never completes because we can't create an IndexWriter: > {code} > "Provenance Repository Rollover Thread-1" Id=128 TIMED_WAITING on null > at java.lang.Thread.sleep(Native Method) > at org.apache.lucene.store.Lock.obtain(Lock.java:92) > at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:755) > at > org.apache.nifi.provenance.lucene.SimpleIndexManager.borrowIndexWriter(SimpleIndexManager.java:104) > - waiting on > org.apache.nifi.provenance.lucene.SimpleIndexManager@22f9da45 > at > org.apache.nifi.provenance.PersistentProvenanceRepository.mergeJournals(PersistentProvenanceRepository.java:1711) > at > org.apache.nifi.provenance.PersistentProvenanceRepository$8.run(PersistentProvenanceRepository.java:1311) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Number of Locked Synchronizers: 1 > - java.util.concurrent.ThreadPoolExecutor$Worker@850f87f > {code} > The Index Writer is blocking on a lock, waiting to obtain a write lock for > the Directory. > Digging around, I believe the issue is that if we call > SimpleIndexManager.returnIndexWriter, it will call IndexWriter.commit(). But > if that throws an Exception, we don't properly close the writer. If we are > running out of disk space, it is likely that we will throw an Exception on > IndexWriter.commit() so this appears to be the root cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)