[jira] [Updated] (SOLR-17497) Pull replicas throws AlreadyClosedException

Sanjay Dutt (Jira) Sun, 13 Oct 2024 22:27:04 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sanjay Dutt updated SOLR-17497:
-------------------------------
    Description: 
Recently, a common exception (org.apache.lucene.store.AlreadyClosedException: 
this Directory is closed) seen in multiple failed test cases. 

FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica

FAILED:  
org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull

FAILED:  org.apache.solr.cloud.TestPullReplica.testAddDocs

 

 
{code:java}
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=10271, name=fsyncService-6341-thread-1, 
state=RUNNABLE, group=TGRP-SplitShardWithNodeRoleTest]
        at 
__randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is 
closed
        at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
        at 
app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
        at 
app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
        at 
app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
        at 
app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
        at 
app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
        at 
app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
        at 
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

 {code}
 
Interesting thing about these test cases is that they all share same kind of 
setup where each has one shard and two replicas – one NRT and another is PULL.
 
Going through one of the test case execution step.
FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
 
Test flow
1. Create a collection with 1 NRT and 1 PULL replica
2. waitForState
3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
4. Index another document.
5. waitForNumDocsInAllActiveReplicas(1);
6. Stop Pull replica
7. Index another document
8. waitForNumDocsInAllActiveReplicas(2);
9. Start Pull Replica
10. waitForState
11. waitForNumDocsInAllActiveReplicas(2);
 
As per the logs the whole sequence executed successfully. Here is the link to 
the logs: 
[https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
 (link may stop working in the future)
 
Last step where they are making sure that all the active replicas should have 
two documents each has logged a info which is another proof that it completed 
successfully. 
 
{code:java}
616575 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) 
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node3 
(https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
 has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348) 
[n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1 
r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2 
t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select 
params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0 
616607 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) 
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node4 
(https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
 has all 2 docs{code}
 

*Where is the issue then?*

In the logs it has been observed, that after restarting the PULL replica. The 
recovery process started and after fetching all the files info from the NRT, 
the replication aborted and logged "User aborted replication"

 
{code:java}
o.a.s.h.IndexFetcher User aborted Replication => 
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted 
replication at 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
 org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted 
replication{code}
 

Inside IndexFetcher once it's aborted, It performs cleanup() operation which do 
the closeup and delete the resource only if the operation is aborted.
{code:java}
private void cleanup() {
 try {
  file.close();
 } catch (Exception e) {
 /* no-op */
 log.error("Error closing file: {}", this.saveAs, e);
 }
 if (bytesDownloaded != size) {
 // if the download is not complete then
 // delete the file being downloaded
 try {
  file.delete();
 } catch (Exception e) {
  log.error("Error deleting file: {}", this.saveAs, e);
 }
 // if the failure is due to a user abort it is returned normally else an    
exception is thrown
 if (!aborted)
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to 
download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" + 
size);
 }
}{code}
After which a sync operation is performed in a thread, and that's where it 
fails.
{code:java}
fsyncService.execute(
 () -> {
   try {
     file.sync();
   } catch (IOException e) {
      fsyncException = e;
   }
});{code}
Now two things:

1. Why would replication is aborted in the first place? And who executes it?

2. Should sync not be performed when the replication is aborted?

 

  was:
Recently, a common exception (org.apache.lucene.store.AlreadyClosedException: 
this Directory is closed) seen in multiple failed test cases. 

FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica

FAILED:  
org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull

FAILED:  org.apache.solr.cloud.TestPullReplica.testAddDocs

 

 
{code:java}
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=10271, name=fsyncService-6341-thread-1, 
state=RUNNABLE, group=TGRP-SplitShardWithNodeRoleTest]
        at 
__randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is 
closed
        at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
        at 
app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
        at 
app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
        at 
app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
        at 
app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
        at 
app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
        at 
app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
        at 
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

 {code}
 
Interesting thing about these test cases is that they all share same kind of 
setup where each has one shard and two replicas – one NRT and another is PULL.
 
Going through one of the test case execution step.
FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
 
Test flow
1. Create a collection with 1 NRT and 1 PULL replica
2. waitForState
3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
4. Index another document.
5. waitForNumDocsInAllActiveReplicas(1);
6. Stop Pull replica
7. Index another document
8. waitForNumDocsInAllActiveReplicas(2);
9. Start Pull Replica
10. waitForState
11. waitForNumDocsInAllActiveReplicas(2);
 
As per the logs the whole sequence executed successfully. Here is the link to 
the logs: 
[https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
 (link may stop working in the future)
 
Last step where they are making sure that all the active replicas should have 
two documents each has logged a info which is another proof that it completed 
successfully. 
 
{code:java}
616575 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) 
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node3 
(https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
 has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348) 
[n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1 
r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2 
t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select 
params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0 
616607 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) 
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node4 
(https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
 has all 2 docs{code}
 

*Where is the issue then?*

In the logs it has been observed, that after restarting the PULL replica. The 
recovery process started and after fetching all the files info from the NRT, 
the replication aborted and logged "User aborted replication"

 
{code:java}
o.a.s.h.IndexFetcher User aborted Replication => 
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted 
replication at 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
 org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted 
replication{code}
 

Inside IndexFetcher once it's aborted, It performs cleanup() operation which do 
the closeup and delete the resource only if the operation is aborted.
{code:java}
private void cleanup() {
 try {
  file.close();
 } catch (Exception e) {
 /* no-op */
 log.error("Error closing file: {}", this.saveAs, e);
 }
 if (bytesDownloaded != size) {
 // if the download is not complete then
 // delete the file being downloaded
 try {
  file.delete();
 } catch (Exception e) {
  log.error("Error deleting file: {}", this.saveAs, e);
 }
 // if the failure is due to a user abort it is returned normally else an    
exception is thrown
 if (!aborted)
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to 
download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" + 
size);
 }
}{code}
After which a sync operation is performed in a thread, and that's where it 
fails.
{code:java}
fsyncService.execute(
 () -> {
   try {
     file.sync();
   } catch (IOException e) {
      fsyncException = e;
   }
});{code}
Now two things:

1. Why would replication is aborted in the first place? And who executes it?

2. Once the resource is deleted, then sync operation cannot be performed. 
Should sync not be performed when the replication is aborted?

 


> Pull replicas throws AlreadyClosedException  
> ---------------------------------------------
>
>                 Key: SOLR-17497
>                 URL: https://issues.apache.org/jira/browse/SOLR-17497
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Sanjay Dutt
>            Priority: Major
>
> Recently, a common exception (org.apache.lucene.store.AlreadyClosedException: 
> this Directory is closed) seen in multiple failed test cases. 
> FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
> FAILED:  
> org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
> FAILED:  org.apache.solr.cloud.TestPullReplica.testAddDocs
>  
>  
> {code:java}
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=10271, 
> name=fsyncService-6341-thread-1, state=RUNNABLE, 
> group=TGRP-SplitShardWithNodeRoleTest]
>         at 
> __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is 
> closed
>         at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
>         at 
> app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
>         at 
> app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
>         at 
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
>         at 
> app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
>         at 
> app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
>         at 
> app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
>         at 
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at [email protected]/java.lang.Thread.run(Thread.java:829)
>  {code}
>  
> Interesting thing about these test cases is that they all share same kind of 
> setup where each has one shard and two replicas – one NRT and another is PULL.
>  
> Going through one of the test case execution step.
> FAILED:  org.apache.solr.cloud.TestPullReplica.testKillPullReplica
>  
> Test flow
> 1. Create a collection with 1 NRT and 1 PULL replica
> 2. waitForState
> 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
> 4. Index another document.
> 5. waitForNumDocsInAllActiveReplicas(1);
> 6. Stop Pull replica
> 7. Index another document
> 8. waitForNumDocsInAllActiveReplicas(2);
> 9. Start Pull Replica
> 10. waitForState
> 11. waitForNumDocsInAllActiveReplicas(2);
>  
> As per the logs the whole sequence executed successfully. Here is the link to 
> the logs: 
> [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
>  (link may stop working in the future)
>  
> Last step where they are making sure that all the active replicas should have 
> two documents each has logged a info which is another proof that it completed 
> successfully. 
>  
> {code:java}
> 616575 INFO 
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: 
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node3 
> (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
>  has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348) 
> [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1 
> r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2 
> t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select 
> params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0 
> 616607 INFO 
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: 
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node4 
> (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
>  has all 2 docs{code}
>  
> *Where is the issue then?*
> In the logs it has been observed, that after restarting the PULL replica. The 
> recovery process started and after fetching all the files info from the NRT, 
> the replication aborted and logged "User aborted replication"
>  
> {code:java}
> o.a.s.h.IndexFetcher User aborted Replication => 
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User 
> aborted replication at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
>  org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User 
> aborted replication{code}
>  
> Inside IndexFetcher once it's aborted, It performs cleanup() operation which 
> do the closeup and delete the resource only if the operation is aborted.
> {code:java}
> private void cleanup() {
>  try {
>   file.close();
>  } catch (Exception e) {
>  /* no-op */
>  log.error("Error closing file: {}", this.saveAs, e);
>  }
>  if (bytesDownloaded != size) {
>  // if the download is not complete then
>  // delete the file being downloaded
>  try {
>   file.delete();
>  } catch (Exception e) {
>   log.error("Error deleting file: {}", this.saveAs, e);
>  }
>  // if the failure is due to a user abort it is returned normally else an    
> exception is thrown
>  if (!aborted)
>   throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to 
> download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" + 
> size);
>  }
> }{code}
> After which a sync operation is performed in a thread, and that's where it 
> fails.
> {code:java}
> fsyncService.execute(
>  () -> {
>    try {
>      file.sync();
>    } catch (IOException e) {
>       fsyncException = e;
>    }
> });{code}
> Now two things:
> 1. Why would replication is aborted in the first place? And who executes it?
> 2. Should sync not be performed when the replication is aborted?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-17497) Pull replicas throws AlreadyClosedException

Reply via email to