Jason Gerlowski created SOLR-18121:
--------------------------------------

             Summary: Ref-counting leak in IndexFetcher's 
"deleteFilesInAdvance" codepath
                 Key: SOLR-18121
                 URL: https://issues.apache.org/jira/browse/SOLR-18121
             Project: Solr
          Issue Type: Bug
          Components: replication (java)
            Reporter: Jason Gerlowski


While experimenting with TLOG and PULL replicas locally, I think I've 
discovered a resource-leak in IndexFetcher.

IndexFetcher's little-known "deleteFilesInAdvance" code-path, which deletes the 
local copy of the index if space is needed to download the remote copy, doesn't 
obey the contract of {{SolrCoreState.closeIndexWriter}}.  The problem code in 
question is around 
[L1273|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1273]:

{code}
    
solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(this.solrCore, 
false);
    for (String f : filesTobeDeleted) {
      try {
        indexDir.deleteFile(f);
      } catch (FileNotFoundException | NoSuchFileException e) {
        // no problem , it was deleted by someone else
      }   
    }   
{code}

The javadocs for {{closeIndexWriter}} indicate:

bq. Expert method that closes the IndexWriter - you must call {@link 
#openIndexWriter(SolrCore)} in a finally block after calling this method.

...which clearly isn't happening in this case.

This causes tangible issues in the field - we (Houston and myself) uncovered 
this issue while debugging a collection-delete operation that stalled out 
indefinitely.  The bug described above messes up the ref-count for the SolrCore 
in question, so deleting this core/replica stalled forever waiting for the 
ref-count to hit 0.

Side note 1: My reading of the logic here suggests that this problem would 
happen reliably any time this codepath is invoked.  It was a tough thing to 
debug, so I wouldn't expect user bug re
ports necessarily.  But our tests have pretty stringent ref-count checks, and 
the fact that no tests report this issue suggests to me that this codepath is 
entirely untested. 

Side note 2: I have some larger questions about whether this codepath is 
something we really want to continue supporting in Solr today.  You can imagine 
scenarios where this could lead to data-loss if a replica gets into this state 
and then something happens to the leader.  Safer to keep the data that this 
replica does have around, even if it's slightly less up to date than what the 
leader has currently.  But I'll raise those on SOLR-12999 which initially 
introduced the feature.  Let's assume on the JIRA ticket that we do want to 
keep the feature around.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to