Brian Davis created NIFI-2395:
---------------------------------
Summary: PersistentProvenanceRepository Deadlocks caused by a
blocked journal merge
Key: NIFI-2395
URL: https://issues.apache.org/jira/browse/NIFI-2395
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework
Affects Versions: 0.6.0
Reporter: Brian Davis
Priority: Critical
I have a nifi instance that I have been running for about a week and has
deadlocked at least 3 times during this time. When I say deadlock the whole
nifi instance stops doing any progress on flowfiles. I looked at the stack
trace and there are a lot of threads stuck doing tasks in the
PersistentProvenanceRepository. Looking at the code I think this is what is
happening:
There is a ReadWriteLock that all the reads are waiting for a write. The write
is in the loop:
{code}
while (journalFileCount > journalCountThreshold || repoSize >
sizeThreshold) {
// if a shutdown happens while we are in this loop, kill
the rollover thread and break
if (this.closed.get()) {
if (future != null) {
future.cancel(true);
}
break;
}
if (repoSize > sizeThreshold) {
logger.debug("Provenance Repository has exceeded its
size threshold; will trigger purging of oldest events");
purgeOldEvents();
journalFileCount = getJournalCount();
repoSize = getSize(getLogFiles(), 0L);
continue;
} else {
// if we are constrained by the number of journal files
rather than the size of the repo,
// then we will just sleep a bit because another thread
is already actively merging the journals,
// due to the runnable that we scheduled above
try {
Thread.sleep(100L);
} catch (final InterruptedException ie) {
}
}
logger.debug("Provenance Repository is still behind.
Keeping flow slowed down "
+ "to accommodate. Currently, there are {} journal
files ({} bytes) and "
+ "threshold for blocking is {} ({} bytes)",
journalFileCount, repoSize, journalCountThreshold, sizeThreshold);
journalFileCount = getJournalCount();
repoSize = getSize(getLogFiles(), 0L);
}
logger.info("Provenance Repository has now caught up with
rolling over journal files. Current number of "
+ "journal files to be rolled over is {}",
journalFileCount);
}
{code}
My nifi is at the sleep indefinitely. The reason my nifi cannot move forward
is because of the thread doing the merge is stopped. The thread doing the
merge is at:
{code}
accepted = eventQueue.offer(new Tuple<>(record, blockIndex), 10,
TimeUnit.MILLISECONDS);
{code}
so the queue is full.
What I believe happened is that the callables created here:
{code}
final Callable<Object> callable = new
Callable<Object>() {
@Override
public Object call() throws IOException {
while (!eventQueue.isEmpty() ||
!finishedAdding.get()) {
final
Tuple<StandardProvenanceEventRecord, Integer> tuple;
try {
tuple = eventQueue.poll(10,
TimeUnit.MILLISECONDS);
} catch (final InterruptedException ie)
{
continue;
}
if (tuple == null) {
continue;
}
indexingAction.index(tuple.getKey(),
indexWriter, tuple.getValue());
}
return null;
}
{code}
finish before the offer adds its first event because I do not see any Index
Provenance Events threads. My guess is the while loop condition is wrong and
should be && instead of ||.
I upped the thread count for the index creation from 1 to 3 to see if that
helps. I can tell you if that helps later this week.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)