Jason Gerlowski created SOLR-18121:
--------------------------------------
Summary: Ref-counting leak in IndexFetcher's
"deleteFilesInAdvance" codepath
Key: SOLR-18121
URL: https://issues.apache.org/jira/browse/SOLR-18121
Project: Solr
Issue Type: Bug
Components: replication (java)
Reporter: Jason Gerlowski
While experimenting with TLOG and PULL replicas locally, I think I've
discovered a resource-leak in IndexFetcher.
IndexFetcher's little-known "deleteFilesInAdvance" code-path, which deletes the
local copy of the index if space is needed to download the remote copy, doesn't
obey the contract of {{SolrCoreState.closeIndexWriter}}. The problem code in
question is around
[L1273|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1273]:
{code}
solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(this.solrCore,
false);
for (String f : filesTobeDeleted) {
try {
indexDir.deleteFile(f);
} catch (FileNotFoundException | NoSuchFileException e) {
// no problem , it was deleted by someone else
}
}
{code}
The javadocs for {{closeIndexWriter}} indicate:
bq. Expert method that closes the IndexWriter - you must call {@link
#openIndexWriter(SolrCore)} in a finally block after calling this method.
...which clearly isn't happening in this case.
This causes tangible issues in the field - we (Houston and myself) uncovered
this issue while debugging a collection-delete operation that stalled out
indefinitely. The bug described above messes up the ref-count for the SolrCore
in question, so deleting this core/replica stalled forever waiting for the
ref-count to hit 0.
Side note 1: My reading of the logic here suggests that this problem would
happen reliably any time this codepath is invoked. It was a tough thing to
debug, so I wouldn't expect user bug re
ports necessarily. But our tests have pretty stringent ref-count checks, and
the fact that no tests report this issue suggests to me that this codepath is
entirely untested.
Side note 2: I have some larger questions about whether this codepath is
something we really want to continue supporting in Solr today. You can imagine
scenarios where this could lead to data-loss if a replica gets into this state
and then something happens to the leader. Safer to keep the data that this
replica does have around, even if it's slightly less up to date than what the
leader has currently. But I'll raise those on SOLR-12999 which initially
introduced the feature. Let's assume on the JIRA ticket that we do want to
keep the feature around.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]