[
https://issues.apache.org/jira/browse/SOLR-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058903#comment-18058903
]
Jason Gerlowski commented on SOLR-18121:
----------------------------------------
I think there are other bugs at play here too. Look at how the 'indexDir'
Directory instance is used in this code:
{code}
solrCore.getDirectoryFactory().doneWithDirectory(indexDir);
solrCore.deleteNonSnapshotIndexFiles(indexDirPath);
this.solrCore.closeSearcher();
assert testWait.getAsBoolean();
solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(this.solrCore,
false);
for (String f : filesTobeDeleted) {
try {
indexDir.deleteFile(f);
} catch (FileNotFoundException | NoSuchFileException e) {
// no problem , it was deleted by someone else
}
}
{code}
We call {{DirectoryFactory.doneWithDirectory}} and then go on to use "indexDir"
~10 lines later. This whole codepath needs some attention IMO if we ultimately
decide to keep it around.
> Ref-counting leak in IndexFetcher's "deleteFilesInAdvance" codepath
> -------------------------------------------------------------------
>
> Key: SOLR-18121
> URL: https://issues.apache.org/jira/browse/SOLR-18121
> Project: Solr
> Issue Type: Bug
> Components: replication (java)
> Reporter: Jason Gerlowski
> Priority: Major
>
> While experimenting with TLOG and PULL replicas locally, I think I've
> discovered a resource-leak in IndexFetcher.
> IndexFetcher's little-known "deleteFilesInAdvance" code-path, which deletes
> the local copy of the index if space is needed to download the remote copy,
> doesn't obey the contract of {{SolrCoreState.closeIndexWriter}}. The problem
> code in question is around
> [L1273|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1273]:
> {code}
>
> solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(this.solrCore,
> false);
> for (String f : filesTobeDeleted) {
> try {
> indexDir.deleteFile(f);
> } catch (FileNotFoundException | NoSuchFileException e) {
> // no problem , it was deleted by someone else
> }
> }
> {code}
> The javadocs for {{closeIndexWriter}} indicate:
> bq. Expert method that closes the IndexWriter - you must call {@link
> #openIndexWriter(SolrCore)} in a finally block after calling this method.
> ...which clearly isn't happening in this case.
> This causes tangible issues in the field - we (Houston and myself) uncovered
> this issue while debugging a collection-delete operation that stalled out
> indefinitely. The bug described above messes up the ref-count for the
> SolrCore in question, so deleting this core/replica stalled forever waiting
> for the ref-count to hit 0.
> Side note 1: My reading of the logic here suggests that this problem would
> happen reliably any time this codepath is invoked. It was a tough thing to
> debug, so I wouldn't expect user bug re
> ports necessarily. But our tests have pretty stringent ref-count checks, and
> the fact that no tests report this issue suggests to me that this codepath is
> entirely untested.
> Side note 2: I have some larger questions about whether this codepath is
> something we really want to continue supporting in Solr today. You can
> imagine scenarios where this could lead to data-loss if a replica gets into
> this state and then something happens to the leader. Safer to keep the data
> that this replica does have around, even if it's slightly less up to date
> than what the leader has currently. But I'll raise those on SOLR-12999 which
> initially introduced the feature. Let's assume on the JIRA ticket that we do
> want to keep the feature around.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]