[jira] [Commented] (SOLR-17497) Pull replicas throws AlreadyClosedException

Sanjay Dutt (Jira) Fri, 25 Oct 2024 00:16:41 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892684#comment-17892684
 ]


Sanjay Dutt commented on SOLR-17497:
------------------------------------

{code:java}
@Test
public void test(){
 ExecutorService fsyncService =   
ExecutorUtil.newMDCAwareSingleThreadExecutor(new 
SolrNamedThreadFactory("fsyncService"));
 try {
   fsyncService.submit(() -> {
throw new AlreadyClosedException("Directory is already closed!");
  });
 } catch (Exception e) {
   System.out.println(e);
 } finally {
   fsyncService.shutdown();
 }
}{code}
In [https://github.com/apache/solr/pull/2707], we have basically replaced 
ExecutorService#submit with ExecutorService#execute, and now execute throws 
exception rather than suppressing it. Same can be tested with the above example 
where running it won't fail, on the other hand If you use execute it will fail 
immediately.  

> Pull replicas throws AlreadyClosedException  
> ---------------------------------------------
>
>                 Key: SOLR-17497
>                 URL: https://issues.apache.org/jira/browse/SOLR-17497
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Sanjay Dutt
>            Priority: Major
>         Attachments: Screenshot 2024-10-23 at 6.01.02 PM.png
>
>
> Recently, a common exception (org.apache.lucene.store.AlreadyClosedException: 
> this Directory is closed) seen in multiple failed test cases. 
> FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
> FAILED:  
> org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
> FAILED:  org.apache.solr.cloud.TestPullReplica.testAddDocs
>  
>  
> {code:java}
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=10271, 
> name=fsyncService-6341-thread-1, state=RUNNABLE, 
> group=TGRP-SplitShardWithNodeRoleTest]
>         at 
> __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is 
> closed
>         at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
>         at 
> app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
>         at 
> app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
>         at 
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
>         at 
> app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
>         at 
> app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
>         at 
> app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
>         at 
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at [email protected]/java.lang.Thread.run(Thread.java:829)
>  {code}
>  
> Interesting thing about these test cases is that they all share same kind of 
> setup where each has one shard and two replicas – one NRT and another is PULL.
>  
> Going through one of the test case execution step.
> FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
>  
> Test flow
> 1. Create a collection with 1 NRT and 1 PULL replica
> 2. waitForState
> 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
> 4. Index another document.
> 5. waitForNumDocsInAllActiveReplicas(1);
> 6. Stop Pull replica
> 7. Index another document
> 8. waitForNumDocsInAllActiveReplicas(2);
> 9. Start Pull Replica
> 10. waitForState
> 11. waitForNumDocsInAllActiveReplicas(2);
>  
> As per the logs the whole sequence executed successfully. Here is the link to 
> the logs: 
> [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
>  (link may stop working in the future)
>  
> Last step where they are making sure that all the active replicas should have 
> two documents each has logged a info which is another proof that it completed 
> successfully. 
>  
> {code:java}
> 616575 INFO 
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: 
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node3 
> (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
>  has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348) 
> [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1 
> r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2 
> t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select 
> params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0 
> 616607 INFO 
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: 
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node4 
> (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
>  has all 2 docs{code}
>  
> *Where is the issue then?*
> In the logs it has been observed, that after restarting the PULL replica. The 
> recovery process started and after fetching all the files info from the NRT, 
> the replication aborted and logged "User aborted replication"
>  
> {code:java}
> o.a.s.h.IndexFetcher User aborted Replication => 
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User 
> aborted replication at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
>  org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User 
> aborted replication{code}
>  
> Inside IndexFetcher once it's aborted, It performs cleanup() operation which 
> do the closeup. And delete the resource only if the downloaded bytes not 
> equal to the size. 
> {code:java}
> private void cleanup() {
>  try {
>   file.close();
>  } catch (Exception e) {
>  /* no-op */
>  log.error("Error closing file: {}", this.saveAs, e);
>  }
>  if (bytesDownloaded != size) {
>  // if the download is not complete then
>  // delete the file being downloaded
>  try {
>   file.delete();
>  } catch (Exception e) {
>   log.error("Error deleting file: {}", this.saveAs, e);
>  }
>  // if the failure is due to a user abort it is returned normally else an    
> exception is thrown
>  if (!aborted)
>   throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to 
> download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" + 
> size);
>  }
> }{code}
> After which a sync operation is performed in a thread, and that's where it 
> fails.
> {code:java}
> fsyncService.execute(
>  () -> {
>    try {
>      file.sync();
>    } catch (IOException e) {
>       fsyncException = e;
>    }
> });{code}
> Now two things:
> 1. Why would replication is aborted in the first place? And who executes it?
> 2. Should sync not be performed when the replication is aborted?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17497) Pull replicas throws AlreadyClosedException

Reply via email to